PHD FINAL PROJECT

Problem Statement

'India Broadband' is a well established company in the broadband space. Due to immense competition, they are facing a major problem of customer churn and dissatisfaction due to broadband outages. The company has curated a dataset, where it tracks several variables that it believes impact the 'outage_duration'. They have tracked three different 'outage durations': '0' for no outage, '1' for short outages that last between a few minutes and a maximum of 2 hours, and '2' for long outages that can last from 2 hours to sometimes even a couple of days. We use the metrics that the company has tracked to create a machine learning model that will be able to predict the 'outage_duration' using the quantitative and qualitative features.

Predict 'outage_duration' to help the company handle outages better, improve customer satisfaction and thereby reduce customer churn.

Target attribute: "outage_duration", where: 0=no outage, 1=short outages (between a few minutes and a maximum of 2 hours), 2=long outages (from 2 hours to sometimes even a couple of days)

This is a classification problem (3 classes = 0,1,2)

The evaluation metric used is 'F1 Macro Average'

Mounting google drive to access files:

from google.colab import drive

drive.mount('/gdrive')

Importing the required libraries

In [1]:
import os
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

from sklearn import preprocessing
from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, RandomizedSearchCV

from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.datasets import make_classification

from sklearn import metrics
from sklearn.metrics import f1_score
In [2]:
import matplotlib.pyplot as plt
from matplotlib import pyplot
import seaborn as sns 
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot, plot
from scipy.stats.mstats import mode

%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

Defining the evaluation metric: F1 Macro Average

In [3]:
def accuracy_precision_recall_f1_metrics(y_true, y_pred):
    
    y_test_scoring = convert_for_sklearn(y_true)
    test_pred_scoring = convert_for_sklearn(y_pred)

    f1 = f1_score(y_true = ytest_scoring, y_pred = test_pred_scoring, average ='macro')
    
    print("Test F1 Score:",f1)

Reading & Loading the data

Loading data to run on Kaggle cloud

print(os.listdir('../input'))

train_data_1 = pd.read_csv("../input/train_data.csv")

broadband_1 = pd.read_csv("../input/broadband_data.csv")

outage_1 = pd.read_csv("../input/outage_data.csv")

report_1 = pd.read_csv("../input/report_data.csv")

server_1 = pd.read_csv("../input/server_data.csv")

test_data_1 = pd.read_csv("../input/test_data-1593798292529.csv")

A) For EDA and Data Visualization

In [4]:
train_data_1 = pd.read_csv("train_data.csv")
broadband_1 = pd.read_csv("broadband_data.csv")
outage_1 = pd.read_csv("outage_data.csv")
report_1 = pd.read_csv("report_data.csv")
server_1 = pd.read_csv("server_data.csv")
sample = pd.read_csv("sample_submission.csv")
test_data_1 = pd.read_csv("test_data-1593798292529.csv")
In [5]:
train_data_1.head(2)
Out[5]:
id area_code outage_duration
0 13366 area_415 1
1 6783 area_474 0
In [6]:
test_data_1.head(2)
Out[6]:
id area_code
0 3340 area_344
1 14067 area_933
In [7]:
print(train_data_1.shape)
print(broadband_1.shape)
print(outage_1.shape)
print(report_1.shape)
print(server_1.shape)
print(sample.shape)
print(test_data_1.shape)
(5904, 3)
(21076, 2)
(18552, 2)
(58671, 3)
(31170, 2)
(1477, 2)
(1477, 2)

Checking the unique values of 'id' column in all the datasets to understand duplicate values and thereby help in understanding the basis of merging the datasets.

In [8]:
print(len(train_data_1.id.unique()))
print(len(broadband_1.id.unique()))
print(len(outage_1.id.unique()))
print(len(report_1.id.unique()))
print(len(server_1.id.unique()))
print(len(test_data_1.id.unique()))
5904
18552
18552
18552
18552
1477

We understand that duplicate values for 'id' column exist in broadband, report, and server datasets.

Merging the datasets

On the basis of id column, merging the datasets after checking and dropping the duplicates for each of the datasets:

1) train/test and outage

Checking duplicates in outage

In [9]:
outage_1.duplicated("id").sum()
Out[9]:
0
In [10]:
train_data_1['g'] = train_data_1.groupby('id').cumcount()
outage_1['g'] = outage_1.groupby('id').cumcount()

result = pd.merge(train_data_1,outage_1,on=["id", 'g'],how='left').drop('g', axis=1)
print (result)
         id  area_code  outage_duration    outage_type
0     13366   area_415                1  outage_type_4
1      6783   area_474                0  outage_type_2
2      9519   area_931                1  outage_type_2
3     10202   area_700                1  outage_type_1
4      4555   area_600                2  outage_type_2
...     ...        ...              ...            ...
5899   1910   area_403                1  outage_type_2
5900  10475   area_821                0  outage_type_1
5901  10675   area_798                2  outage_type_2
5902  14714   area_210                0  outage_type_2
5903   6253  area_1007                0  outage_type_2

[5904 rows x 4 columns]
In [11]:
test_data_1['g'] = test_data_1.groupby('id').cumcount()
outage_1['g'] = outage_1.groupby('id').cumcount()

test1 = pd.merge(test_data_1,outage_1,on=["id", 'g'],how='left').drop('g', axis=1)
print (test1)
         id  area_code    outage_type
0      3340   area_344  outage_type_4
1     14067   area_933  outage_type_1
2      1134    area_16  outage_type_1
3        27   area_793  outage_type_1
4      9793   area_344  outage_type_2
...     ...        ...            ...
1472   7975   area_124  outage_type_2
1473  10664   area_821  outage_type_1
1474   7753   area_476  outage_type_2
1475    989    area_38  outage_type_4
1476   3129  area_1067  outage_type_1

[1477 rows x 3 columns]
In [12]:
result.head(2)
Out[12]:
id area_code outage_duration outage_type
0 13366 area_415 1 outage_type_4
1 6783 area_474 0 outage_type_2
In [13]:
test1.head(2)
Out[13]:
id area_code outage_type
0 3340 area_344 outage_type_4
1 14067 area_933 outage_type_1

2) result/test1 and broadband

Checking shape of broadband

In [14]:
broadband_1.shape
Out[14]:
(21076, 2)

Checking duplicates of broadband

In [15]:
broadband_1.duplicated("id").sum()
Out[15]:
2524
In [16]:
result['g'] = result.groupby('id').cumcount()
broadband_1['g'] = broadband_1.groupby('id').cumcount()

result2 = pd.merge(result,broadband_1,on=["id", 'g'],how='left').drop('g', axis=1)
print (result2)
         id  area_code  outage_duration    outage_type    broadband_type
0     13366   area_415                1  outage_type_4  broadband_type_2
1      6783   area_474                0  outage_type_2  broadband_type_2
2      9519   area_931                1  outage_type_2  broadband_type_8
3     10202   area_700                1  outage_type_1  broadband_type_8
4      4555   area_600                2  outage_type_2  broadband_type_8
...     ...        ...              ...            ...               ...
5899   1910   area_403                1  outage_type_2  broadband_type_2
5900  10475   area_821                0  outage_type_1  broadband_type_8
5901  10675   area_798                2  outage_type_2  broadband_type_8
5902  14714   area_210                0  outage_type_2  broadband_type_2
5903   6253  area_1007                0  outage_type_2  broadband_type_8

[5904 rows x 5 columns]
In [17]:
test1['g'] = test1.groupby('id').cumcount()
broadband_1['g'] = broadband_1.groupby('id').cumcount()

test2 = pd.merge(test1,broadband_1,on=["id", 'g'],how='left').drop('g', axis=1)
print (test2)
         id  area_code    outage_type    broadband_type
0      3340   area_344  outage_type_4  broadband_type_2
1     14067   area_933  outage_type_1  broadband_type_2
2      1134    area_16  outage_type_1  broadband_type_8
3        27   area_793  outage_type_1  broadband_type_8
4      9793   area_344  outage_type_2  broadband_type_2
...     ...        ...            ...               ...
1472   7975   area_124  outage_type_2  broadband_type_2
1473  10664   area_821  outage_type_1  broadband_type_8
1474   7753   area_476  outage_type_2  broadband_type_2
1475    989    area_38  outage_type_4  broadband_type_9
1476   3129  area_1067  outage_type_1  broadband_type_8

[1477 rows x 4 columns]
In [18]:
result2.head(2)
Out[18]:
id area_code outage_duration outage_type broadband_type
0 13366 area_415 1 outage_type_4 broadband_type_2
1 6783 area_474 0 outage_type_2 broadband_type_2
In [19]:
test2.head(2)
Out[19]:
id area_code outage_type broadband_type
0 3340 area_344 outage_type_4 broadband_type_2
1 14067 area_933 outage_type_1 broadband_type_2

3) result2/test2 and report

Checking shape of report

In [20]:
report_1.shape
Out[20]:
(58671, 3)

Checking duplicates of report

In [21]:
report_1.duplicated("id").sum()
Out[21]:
40119
In [22]:
result2['g'] = result2.groupby('id').cumcount()
report_1['g'] = report_1.groupby('id').cumcount()

result3 = pd.merge(result2,report_1,on=["id", 'g'],how='left').drop('g', axis=1)
print (result3)
         id  area_code  outage_duration    outage_type    broadband_type  \
0     13366   area_415                1  outage_type_4  broadband_type_2   
1      6783   area_474                0  outage_type_2  broadband_type_2   
2      9519   area_931                1  outage_type_2  broadband_type_8   
3     10202   area_700                1  outage_type_1  broadband_type_8   
4      4555   area_600                2  outage_type_2  broadband_type_8   
...     ...        ...              ...            ...               ...   
5899   1910   area_403                1  outage_type_2  broadband_type_2   
5900  10475   area_821                0  outage_type_1  broadband_type_8   
5901  10675   area_798                2  outage_type_2  broadband_type_8   
5902  14714   area_210                0  outage_type_2  broadband_type_2   
5903   6253  area_1007                0  outage_type_2  broadband_type_8   

          log_report_type  volume  
0     log_report_type_312       1  
1     log_report_type_312       2  
2     log_report_type_203       3  
3      log_report_type_54       1  
4      log_report_type_82      30  
...                   ...     ...  
5899  log_report_type_232       1  
5900   log_report_type_74       4  
5901  log_report_type_203       3  
5902  log_report_type_232       2  
5903  log_report_type_201       6  

[5904 rows x 7 columns]
In [23]:
test2['g'] = test2.groupby('id').cumcount()
report_1['g'] = report_1.groupby('id').cumcount()

test3 = pd.merge(test2,report_1,on=["id", 'g'],how='left').drop('g', axis=1)
print (test3)
         id  area_code    outage_type    broadband_type      log_report_type  \
0      3340   area_344  outage_type_4  broadband_type_2  log_report_type_312   
1     14067   area_933  outage_type_1  broadband_type_2  log_report_type_134   
2      1134    area_16  outage_type_1  broadband_type_8  log_report_type_155   
3        27   area_793  outage_type_1  broadband_type_8   log_report_type_73   
4      9793   area_344  outage_type_2  broadband_type_2  log_report_type_312   
...     ...        ...            ...               ...                  ...   
1472   7975   area_124  outage_type_2  broadband_type_2  log_report_type_312   
1473  10664   area_821  outage_type_1  broadband_type_8   log_report_type_80   
1474   7753   area_476  outage_type_2  broadband_type_2  log_report_type_234   
1475    989    area_38  outage_type_4  broadband_type_9  log_report_type_240   
1476   3129  area_1067  outage_type_1  broadband_type_8   log_report_type_54   

      volume  
0          2  
1          1  
2         10  
3          3  
4         12  
...      ...  
1472       1  
1473       1  
1474       1  
1475       2  
1476       8  

[1477 rows x 6 columns]
In [24]:
result3.head(2)
Out[24]:
id area_code outage_duration outage_type broadband_type log_report_type volume
0 13366 area_415 1 outage_type_4 broadband_type_2 log_report_type_312 1
1 6783 area_474 0 outage_type_2 broadband_type_2 log_report_type_312 2
In [25]:
test3.head(2)
Out[25]:
id area_code outage_type broadband_type log_report_type volume
0 3340 area_344 outage_type_4 broadband_type_2 log_report_type_312 2
1 14067 area_933 outage_type_1 broadband_type_2 log_report_type_134 1

4) result3/test3 and server

Checking shape of server

In [26]:
server_1.shape
Out[26]:
(31170, 2)

Checking duplicates of server

In [27]:
server_1.duplicated("id").sum()
Out[27]:
12618
In [28]:
result3['g'] = result3.groupby('id').cumcount()
server_1['g'] = server_1.groupby('id').cumcount()

train_1 = pd.merge(result3,server_1,on=["id", 'g'],how='left').drop('g', axis=1)
print (train_1)
         id  area_code  outage_duration    outage_type    broadband_type  \
0     13366   area_415                1  outage_type_4  broadband_type_2   
1      6783   area_474                0  outage_type_2  broadband_type_2   
2      9519   area_931                1  outage_type_2  broadband_type_8   
3     10202   area_700                1  outage_type_1  broadband_type_8   
4      4555   area_600                2  outage_type_2  broadband_type_8   
...     ...        ...              ...            ...               ...   
5899   1910   area_403                1  outage_type_2  broadband_type_2   
5900  10475   area_821                0  outage_type_1  broadband_type_8   
5901  10675   area_798                2  outage_type_2  broadband_type_8   
5902  14714   area_210                0  outage_type_2  broadband_type_2   
5903   6253  area_1007                0  outage_type_2  broadband_type_8   

          log_report_type  volume     transit_server_type  
0     log_report_type_312       1  transit_server_type_35  
1     log_report_type_312       2  transit_server_type_35  
2     log_report_type_203       3  transit_server_type_15  
3      log_report_type_54       1  transit_server_type_11  
4      log_report_type_82      30  transit_server_type_15  
...                   ...     ...                     ...  
5899  log_report_type_232       1  transit_server_type_34  
5900   log_report_type_74       4  transit_server_type_15  
5901  log_report_type_203       3  transit_server_type_15  
5902  log_report_type_232       2  transit_server_type_35  
5903  log_report_type_201       6  transit_server_type_15  

[5904 rows x 8 columns]
In [29]:
test3['g'] = test3.groupby('id').cumcount()
server_1['g'] = server_1.groupby('id').cumcount()

test_1 = pd.merge(test3,server_1,on=["id", 'g'],how='left').drop('g', axis=1)
print (test_1)
         id  area_code    outage_type    broadband_type      log_report_type  \
0      3340   area_344  outage_type_4  broadband_type_2  log_report_type_312   
1     14067   area_933  outage_type_1  broadband_type_2  log_report_type_134   
2      1134    area_16  outage_type_1  broadband_type_8  log_report_type_155   
3        27   area_793  outage_type_1  broadband_type_8   log_report_type_73   
4      9793   area_344  outage_type_2  broadband_type_2  log_report_type_312   
...     ...        ...            ...               ...                  ...   
1472   7975   area_124  outage_type_2  broadband_type_2  log_report_type_312   
1473  10664   area_821  outage_type_1  broadband_type_8   log_report_type_80   
1474   7753   area_476  outage_type_2  broadband_type_2  log_report_type_234   
1475    989    area_38  outage_type_4  broadband_type_9  log_report_type_240   
1476   3129  area_1067  outage_type_1  broadband_type_8   log_report_type_54   

      volume     transit_server_type  
0          2  transit_server_type_35  
1          1  transit_server_type_20  
2         10  transit_server_type_11  
3          3  transit_server_type_11  
4         12  transit_server_type_34  
...      ...                     ...  
1472       1  transit_server_type_34  
1473       1  transit_server_type_11  
1474       1  transit_server_type_35  
1475       2  transit_server_type_27  
1476       8  transit_server_type_11  

[1477 rows x 7 columns]
In [30]:
train_1.head(2)
Out[30]:
id area_code outage_duration outage_type broadband_type log_report_type volume transit_server_type
0 13366 area_415 1 outage_type_4 broadband_type_2 log_report_type_312 1 transit_server_type_35
1 6783 area_474 0 outage_type_2 broadband_type_2 log_report_type_312 2 transit_server_type_35
In [31]:
test_1.head(2)
Out[31]:
id area_code outage_type broadband_type log_report_type volume transit_server_type
0 3340 area_344 outage_type_4 broadband_type_2 log_report_type_312 2 transit_server_type_35
1 14067 area_933 outage_type_1 broadband_type_2 log_report_type_134 1 transit_server_type_20
In [32]:
train_1.shape, test_1.shape
Out[32]:
((5904, 8), (1477, 7))

Shape of sample submission file

In [33]:
sample.shape
Out[33]:
(1477, 2)

Overview of the data

In [34]:
#!pip install pandas-profiling[notebook,html]
In [35]:
import pandas_profiling
In [36]:
pandas_profiling.ProfileReport(train_1)








Out[36]:

We get a detailed report and insights on the entire dataset including the features - categorical & numerical, missing values, correlation between different numerical features, unique/distinct values, and a sample of the dataset (first few & last few rows).

Understanding the data

In [37]:
train_1.dtypes
Out[37]:
id                      int64
area_code              object
outage_duration         int64
outage_type            object
broadband_type         object
log_report_type        object
volume                  int64
transit_server_type    object
dtype: object
In [38]:
train_1.describe()
Out[38]:
id outage_duration volume
count 5904.000000 5904.000000 5904.000000
mean 9323.841972 0.450203 8.528455
std 5338.654942 0.666642 22.752851
min 1.000000 0.000000 1.000000
25% 4763.500000 0.000000 1.000000
50% 9330.500000 0.000000 2.000000
75% 13932.750000 1.000000 7.000000
max 18550.000000 2.000000 636.000000
In [39]:
train_1.describe(include='object')
Out[39]:
area_code outage_type broadband_type log_report_type transit_server_type
count 5904 5904 5904 5904 5904
unique 876 5 9 213 44
top area_821 outage_type_2 broadband_type_8 log_report_type_312 transit_server_type_11
freq 67 2875 2882 580 1820
In [40]:
train_1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5904 entries, 0 to 5903
Data columns (total 8 columns):
id                     5904 non-null int64
area_code              5904 non-null object
outage_duration        5904 non-null int64
outage_type            5904 non-null object
broadband_type         5904 non-null object
log_report_type        5904 non-null object
volume                 5904 non-null int64
transit_server_type    5904 non-null object
dtypes: int64(3), object(5)
memory usage: 575.1+ KB

Checking for missing values

In [41]:
train_1.isnull().sum()
Out[41]:
id                     0
area_code              0
outage_duration        0
outage_type            0
broadband_type         0
log_report_type        0
volume                 0
transit_server_type    0
dtype: int64

There are no missing values in the train data.

Conversion of object type features to categorical variables for visualization

In [42]:
for col in ['outage_type','broadband_type','log_report_type','transit_server_type','area_code']:
    train_1[col] = train_1[col].astype('category')
In [43]:
cat_attr = list(train_1.select_dtypes("category").columns)
num_attr = list(train_1.columns.difference(cat_attr))
In [44]:
num_attr
Out[44]:
['id', 'outage_duration', 'volume']
In [45]:
cat_attr
Out[45]:
['area_code',
 'outage_type',
 'broadband_type',
 'log_report_type',
 'transit_server_type']
In [46]:
train_1.dtypes
Out[46]:
id                        int64
area_code              category
outage_duration           int64
outage_type            category
broadband_type         category
log_report_type        category
volume                    int64
transit_server_type    category
dtype: object

Exploratory Data Analysis & Visualization

Understanding the distribution of dependent variable "outage_duration"

In [47]:
sns.countplot(x='outage_duration',data = train_1)
plt.show()

train_1.outage_duration.value_counts(normalize = True)*100
Out[47]:
0    64.820461
1    25.338753
2     9.840786
Name: outage_duration, dtype: float64

As per the above visualization, we understand that 64% of outage duration has no outage issues, 25% has short outage issues and 9.8% (as the least) has long outage issues.

Univariate analysis

[I] Creation of a bar chart with counts for categorical variables in plotly

In [48]:
def generate_layout_bar(col_name):
    layout_bar = go.Layout(
        autosize=False, # auto size the graph? use False if you are specifying the height and width
        width=800, # height of the figure in pixels
        height=600, # height of the figure in pixels
        title = "Distribution of {} column".format(col_name), # title of the figure
        # more granular control on the title font 
        titlefont=dict( 
            family='Courier New, monospace', # font family
            size=14, # size of the font
            color='black' # color of the font
        ),
        # granular control on the axes objects 
        xaxis=dict( 
        tickfont=dict(
            family='Courier New, monospace', # font family
            size=14, # size of ticks displayed on the x axis
            color='black'  # color of the font
            )
        ),
        yaxis=dict(
            title='Percentage',
            titlefont=dict(
                size=14,
                color='black'
            ),
        tickfont=dict(
            family='Courier New, monospace', # font family
            size=14, # size of ticks displayed on the y axis
            color='black' # color of the font
            )
        ),
        font = dict(
            family='Courier New, monospace', # font family
            color = "white",# color of the font
            size = 12 # size of the font displayed on the bar
                )  
        )
    return layout_bar
In [49]:
def plot_bar(col_name):
    # create a table with value counts
    temp = train_1[col_name].value_counts()
    # creating a Bar chart object of plotly
    data = [go.Bar(
            x=temp.index.astype(str), # x axis values
            y=np.round(temp.values.astype(float)/temp.values.sum(),4)*100, # y axis values
            text = ['{}%'.format(i) for i in np.round(temp.values.astype(float)/temp.values.sum(),4)*100],
        # text to be displayed on the bar, we are doing this to display the '%' symbol along with the number on the bar
            textposition = 'auto', # specify at which position on the bar the text should appear
        marker = dict(color = '#0047AB'),)] # change color of the bar
    # color used here Cobalt Blue
     
    layout_bar = generate_layout_bar(col_name=col_name)

    fig = go.Figure(data=data, layout=layout_bar)
    return iplot(fig)

1) Distribution of area_code

In [50]:
plot_bar('area_code')

2) Distribution of broadband_type

In [51]:
plot_bar('broadband_type')

3) Distribution of outage_type

In [52]:
plot_bar('outage_type')

4) Distribution of log_report_type

In [53]:
plot_bar('log_report_type')

5) Distribution of transit_server_type

In [54]:
plot_bar('transit_server_type')

[II] Creation of a plots for numerical variables in plotly

Disribution of volume

In [55]:
def summary(x):
    x_min = train_1[x].min()
    x_max = train_1[x].max()
    Q1 = train_1[x].quantile(0.25)
    Q2 = train_1[x].quantile(0.50)
    Q3 = train_1[x].quantile(0.75)
    print(f'5 Point Summary of {x.capitalize()} Attribute:\n'
          f'{x.capitalize()}(min) : {x_min}\n'
          f'Q1                    : {Q1}\n'
          f'Q2(Median)            : {Q2}\n'
          f'Q3                    : {Q3}\n'
          f'{x.capitalize()}(max) : {x_max}')

    fig = plt.figure(figsize=(16, 10))
    plt.subplots_adjust(hspace = 0.6)
    sns.set_palette('pastel')
    
    plt.subplot(221)
    ax1 = sns.distplot(train_1[x], color = 'r')
    plt.title(f'{x.capitalize()} Density Distribution')
    
    plt.subplot(222)
    ax2 = sns.violinplot(x = train_1[x], palette = 'Accent', split = True)
    plt.title(f'{x.capitalize()} Violinplot')
    
    plt.subplot(223)
    ax2 = sns.boxplot(x=train_1[x], palette = 'cool', width=0.5, linewidth=0.6)
    plt.title(f'{x.capitalize()} Boxplot')
    
    plt.subplot(224)
    ax3 = sns.kdeplot(train_1[x], cumulative=True)
    plt.title(f'{x.capitalize()} Cumulative Density Distribution')
    
    plt.show()
In [56]:
def box_plot(x = 'bmi'):
    def add_values(bp, ax):
        """ This actually adds the numbers to the various points of the boxplots"""
        for element in ['whiskers', 'medians', 'caps']:
            for line in bp[element]:
                # Get the position of the element. y is the label you want
                (x_l, y),(x_r, _) = line.get_xydata()
                # Make sure datapoints exist 
                # (I've been working with intervals, should not be problem for this case)
                if not np.isnan(y): 
                    x_line_center = x_l + (x_r - x_l)/2
                    y_line_center = y  # Since it's a line and it's horisontal
                    # overlay the value:  on the line, from center to right
                    ax.text(x_line_center, y_line_center, # Position
                            '%.2f' % y, # Value (3f = 3 decimal float)
                            verticalalignment='center', # Centered vertically with line 
                            fontsize=12, backgroundcolor="white")

    fig, axes = plt.subplots(1, figsize=(4, 8))

    red_diamond = dict(markerfacecolor='r', marker='D')

    bp_dict = train_1.boxplot(column = x, 
                             grid=True, 
                             figsize=(4, 8), 
                             ax=axes, 
                             vert = True, 
                             notch=False, 
                             widths = 0.5, 
                             showmeans = True, 
                             whis = 1.5,
                             flierprops = red_diamond,
                             boxprops= dict(linewidth=3.0, color='black'),
                             whiskerprops=dict(linewidth=3.0, color='black'),
                             return_type = 'dict')

    add_values(bp_dict, axes)

    plt.title(f'{x.capitalize()} Boxplot', fontsize=16)
    plt.ylabel(f'{x.capitalize()}', fontsize=14)
    plt.show()
    
    skew = train_1[x].skew()
    Q1 = train_1[x].quantile(0.25)
    Q3 = train_1[x].quantile(0.75)
    IQR = Q3 - Q1
    total_outlier_num = ((train_1[x] < (Q1 - 1.5 * IQR)) | (train_1[x] > (Q3 + 1.5 * IQR))).sum()
    print(f'Mean {x.capitalize()} = {train_1[x].mean()}')
    print(f'Median {x.capitalize()} = {train_1[x].median()}')
    print(f'Skewness of {x}: {skew}.')
    print(f'Total number of outliers in {x} distribution: {total_outlier_num}.') 
In [57]:
summary('volume')
5 Point Summary of Volume Attribute:
Volume(min) : 1
Q1                    : 1.0
Q2(Median)            : 2.0
Q3                    : 7.0
Volume(max) : 636
In [58]:
box_plot('volume')
Mean Volume = 8.528455284552846
Median Volume = 2.0
Skewness of volume: 10.001654150368328.
Total number of outliers in volume distribution: 702.

The box plot helps us identify outliers in distribution and measure variablity in the dataset.

Bivariate analysis

Understanding the relation between outage_duration and other categorical independent variables

(w.r.t. outage_duration == 2) being the highest outage duration

1) area_code & outage_duration

In [59]:
count = train_1[train_1.outage_duration == 2].area_code.value_counts()
count.head(10).plot(kind="bar", figsize =(10,10))
count.head(10)
Out[59]:
area_1100    28
area_1107    27
area_600     27
area_821     24
area_734     23
area_704     19
area_1019    18
area_810     18
area_834     17
area_684     17
Name: area_code, dtype: int64

Q1 : Which areas are most prone to long outage durations?

=> On the basis of the EDA, the areas most prone to long outage durations are:

area_1100 = 28,

area_1107 = 27,

area_600 = 27,

area_821 = 24,

area_734 = 23..

2) broadband_type & outage_duration

In [60]:
count = train_1[train_1.outage_duration == 2].broadband_type.value_counts()
count.head(10).plot(kind="bar", figsize =(10,10))
count.head(10)
Out[60]:
broadband_type_8     498
broadband_type_2      62
broadband_type_6       8
broadband_type_1       5
broadband_type_7       4
broadband_type_9       1
broadband_type_4       1
broadband_type_3       1
broadband_type_10      1
Name: broadband_type, dtype: int64

Q2 : Which broadband types are suspect of long outage durations?

=> On the basis of the visualization, the broadband types most suspected of long outage durations are:

broadband_type_8 'ADSL 1' = 498,

broadband_type_2 'ADSL 2' = 62,

broadband_type_6 'ADSL 2+' = 8,

broadband_type_1 'Fiber Ultra' = 5,

broadband_type_7 'Cable' = 4 ..

3) outage_type & outage_duration

In [61]:
count = train_1[train_1.outage_duration == 2].outage_type.value_counts()
count.head(10).plot(kind="bar", figsize =(5,5))
count.head(10)
Out[61]:
outage_type_1    382
outage_type_2    199
outage_type_5      0
outage_type_4      0
outage_type_3      0
Name: outage_type, dtype: int64

4) log_report_type & outage_duration

In [62]:
count = train_1[train_1.outage_duration == 2].log_report_type.value_counts()
count.head(10).plot(kind="bar", figsize =(5,5))
count.head(10)
Out[62]:
log_report_type_82     156
log_report_type_203    142
log_report_type_71      45
log_report_type_80      25
log_report_type_193     20
log_report_type_54      18
log_report_type_170     17
log_report_type_68      13
log_report_type_283     12
log_report_type_201     10
Name: log_report_type, dtype: int64

5) transit_server_type & outage_duration

In [63]:
count = train_1[train_1.outage_duration == 2].transit_server_type.value_counts()
count.head(10).plot(kind="bar", figsize =(5,5))
count.head(10)
Out[63]:
transit_server_type_15    296
transit_server_type_11    189
transit_server_type_54     21
transit_server_type_44     19
transit_server_type_13     13
transit_server_type_42      9
transit_server_type_20      8
transit_server_type_14      5
transit_server_type_43      4
transit_server_type_50      3
Name: transit_server_type, dtype: int64

On the basis of the above plots, the client would have to focus outage control solutions in:

(1) area_1100, (2) broadband_type_8 'ADSL 1', (3) outage_type_1, (4) log_report_type_82, and (5) transit_server_type_15

as they face maximum outage issues varying from 2 hours to couple of days.

Understanding the relation of outage_duration with numerical variables

volume & outage_duration

In [64]:
g = sns.FacetGrid(train_1, col='outage_duration')
g.map(plt.hist, 'volume', bins=20)
Out[64]:
<seaborn.axisgrid.FacetGrid at 0x12ff035d0>

As per the above visualization, majority composition of volume lies with outage_duration = 0 category, relatively less with outage_duration = 1, and the least with outage_duration = 2.

Understanding the distribution of outage_duration and other independent variables

Correlation matrix

In [65]:
corrmat = train_1.corr()

plt.subplots(figsize=(10, 5))
sns.heatmap(corrmat, vmax=.9, square=True)
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x12bd95490>

The correlation matrix only shows relation between the numerical attributes (id, volume and outage_duration). To see the complete correlation matrix we would have to convert the categorical columns to integers using label encoding or dummification and then visualize the correlation matrix again.

In [66]:
train_1.dtypes
Out[66]:
id                        int64
area_code              category
outage_duration           int64
outage_type            category
broadband_type         category
log_report_type        category
volume                    int64
transit_server_type    category
dtype: object
In [67]:
train_1.isnull().sum()
Out[67]:
id                     0
area_code              0
outage_duration        0
outage_type            0
broadband_type         0
log_report_type        0
volume                 0
transit_server_type    0
dtype: int64

Dummification

In [68]:
train_1 = pd.get_dummies(columns=['outage_type','broadband_type'], data=train_1)
In [69]:
test_1 = pd.get_dummies(columns=['outage_type','broadband_type'], data = test_1)
In [70]:
train_1.head(2)
Out[70]:
id area_code outage_duration log_report_type volume transit_server_type outage_type_outage_type_1 outage_type_outage_type_2 outage_type_outage_type_3 outage_type_outage_type_4 outage_type_outage_type_5 broadband_type_broadband_type_1 broadband_type_broadband_type_10 broadband_type_broadband_type_2 broadband_type_broadband_type_3 broadband_type_broadband_type_4 broadband_type_broadband_type_6 broadband_type_broadband_type_7 broadband_type_broadband_type_8 broadband_type_broadband_type_9
0 13366 area_415 1 log_report_type_312 1 transit_server_type_35 0 0 0 1 0 0 0 1 0 0 0 0 0 0
1 6783 area_474 0 log_report_type_312 2 transit_server_type_35 0 1 0 0 0 0 0 1 0 0 0 0 0 0
In [71]:
test_1.head(2)
Out[71]:
id area_code log_report_type volume transit_server_type outage_type_outage_type_1 outage_type_outage_type_2 outage_type_outage_type_4 outage_type_outage_type_5 broadband_type_broadband_type_1 broadband_type_broadband_type_10 broadband_type_broadband_type_2 broadband_type_broadband_type_3 broadband_type_broadband_type_4 broadband_type_broadband_type_5 broadband_type_broadband_type_6 broadband_type_broadband_type_7 broadband_type_broadband_type_8 broadband_type_broadband_type_9
0 3340 area_344 log_report_type_312 2 transit_server_type_35 0 0 1 0 0 0 1 0 0 0 0 0 0 0
1 14067 area_933 log_report_type_134 1 transit_server_type_20 1 0 0 0 0 0 1 0 0 0 0 0 0 0
In [72]:
cat_attr = list(train_1.select_dtypes("category").columns)
num_attr = list(train_1.columns.difference(cat_attr))
In [73]:
num_attr
Out[73]:
['broadband_type_broadband_type_1',
 'broadband_type_broadband_type_10',
 'broadband_type_broadband_type_2',
 'broadband_type_broadband_type_3',
 'broadband_type_broadband_type_4',
 'broadband_type_broadband_type_6',
 'broadband_type_broadband_type_7',
 'broadband_type_broadband_type_8',
 'broadband_type_broadband_type_9',
 'id',
 'outage_duration',
 'outage_type_outage_type_1',
 'outage_type_outage_type_2',
 'outage_type_outage_type_3',
 'outage_type_outage_type_4',
 'outage_type_outage_type_5',
 'volume']
In [74]:
train_1.drop(["outage_duration"], axis = 1, inplace = True)
In [75]:
train_1.shape, test_1.shape
Out[75]:
((5904, 19), (1477, 19))
In [76]:
corrmat = train_1.corr()

plt.subplots(figsize=(20, 10))
sns.heatmap(corrmat, annot=True, cmap = 'summer_r')
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x12bcd4490>

From the correlation plot, we can conclude that broadband_type_8 shows a weak correlation with broadband_type_2, and broadband_type_9 shows a strong correlation with outage_type_5.

Q3 : Any other recommendations to improve the detection of outage durations:

1) Map the outage_types correlating to the high outage duration against id column to detect outage durations. In turn, build on improving the specific outage type to improve outage issues.

2) Map the transit server types with the highest outage durations against the id column to detect what kind of servers are facing high outage durations and implement corrective actions for it.

3) Using GridWatch, the sudden drops of WiFi signals, ambient lights, and charging patterns of users penetrating across the city can be monitored.

(Research article reference : https://noahklugman.com/papers/deployment_smartgridcomm.pdf)

Reading & Loading the datasets

B) For Model Building and Feature Engineering

To run through Kaggle cloud

train_data = pd.read_csv("../input/train_data.csv")

broadband = pd.read_csv("../input/broadband_data.csv")

outage = pd.read_csv("../input/outage_data.csv")

report = pd.read_csv("../input/report_data.csv")

server = pd.read_csv("../input/server_data.csv")

test_data = pd.read_csv("../input/test_data-1593798292529.csv")

In [77]:
train_data = pd.read_csv("train_data.csv")
broadband = pd.read_csv("broadband_data.csv")
outage = pd.read_csv("outage_data.csv")
report = pd.read_csv("report_data.csv")
server = pd.read_csv("server_data.csv")
sample = pd.read_csv("sample_submission.csv")
test_data = pd.read_csv("test_data-1593798292529.csv")
In [78]:
train_data.head(2)
Out[78]:
id area_code outage_duration
0 13366 area_415 1
1 6783 area_474 0
In [79]:
test_data.head(2)
Out[79]:
id area_code
0 3340 area_344
1 14067 area_933
In [80]:
print(train_data.shape)
print(broadband.shape)
print(outage.shape)
print(report.shape)
print(server.shape)
print(sample.shape)
print(test_data.shape)
(5904, 3)
(21076, 2)
(18552, 2)
(58671, 3)
(31170, 2)
(1477, 2)
(1477, 2)

Checking the unique values of 'id' column in all the datasets to understand duplicate values and thereby help in understanding the basis of merging the datasets.

In [81]:
print(len(train_data.id.unique()))
print(len(broadband.id.unique()))
print(len(outage.id.unique()))
print(len(report.id.unique()))
print(len(server.id.unique()))
print(len(test_data.id.unique()))
5904
18552
18552
18552
18552
1477

We understand that duplicate values for 'id' column exist in broadband, report, and server datasets.

Merging the datasets

Instead of dropping duplicate id values, we merge the files using pivot table function to avoid loss of data.

In [82]:
train_data['origin'] = 'train'
test_data['origin'] = 'test'
data_temp = pd.concat([train_data,test_data], ignore_index=True)
print(data_temp.head())
print(data_temp.tail())
  area_code     id origin  outage_duration
0  area_415  13366  train              1.0
1  area_474   6783  train              0.0
2  area_931   9519  train              1.0
3  area_700  10202  train              1.0
4  area_600   4555  train              2.0
      area_code     id origin  outage_duration
7376   area_124   7975   test              NaN
7377   area_821  10664   test              NaN
7378   area_476   7753   test              NaN
7379    area_38    989   test              NaN
7380  area_1067   3129   test              NaN
In [83]:
print(train_data.shape)
print(test_data.shape)
print(data_temp.shape)
print(broadband.shape)
print(outage.shape)
print(report.shape)
print(server.shape)
(5904, 4)
(1477, 3)
(7381, 4)
(21076, 2)
(18552, 2)
(58671, 3)
(31170, 2)
In [84]:
print(outage.head(2))
print(report.head(2))
print(server.head(2))
     id    outage_type
0  6597  outage_type_2
1  8011  outage_type_2
     id     log_report_type  volume
0  6597  log_report_type_68       6
1  8011  log_report_type_68       7
     id     transit_server_type
0  6597  transit_server_type_11
1  8011  transit_server_type_15
In [85]:
data_columns=["id","area_code","origin","outage_duration"]
data_columns
Out[85]:
['id', 'area_code', 'origin', 'outage_duration']
In [86]:
data=data_temp[data_columns]
In [87]:
data.head(2)
Out[87]:
id area_code origin outage_duration
0 13366 area_415 train 1.0
1 6783 area_474 train 0.0
In [88]:
data.tail(2)
Out[88]:
id area_code origin outage_duration
7379 989 area_38 test NaN
7380 3129 area_1067 test NaN

1) Broadband

In [89]:
print(broadband.head(2))
     id    broadband_type
0  6597  broadband_type_8
1  8011  broadband_type_8
In [90]:
broadband = broadband.merge(data, on='id', how='left')
broadband.head(2)
Out[90]:
id broadband_type area_code origin outage_duration
0 6597 broadband_type_8 NaN NaN NaN
1 8011 broadband_type_8 area_1 train 0.0
In [91]:
broadband_unq = pd.DataFrame(broadband['broadband_type'].value_counts())
broadband_unq.head()
Out[91]:
broadband_type
broadband_type_8 10268
broadband_type_2 8918
broadband_type_6 582
broadband_type_7 498
broadband_type_4 330

Determine % of training samples:

In [92]:
broadband_unq['PercTrain'] = broadband.pivot_table(values='origin',index='broadband_type',aggfunc=lambda x: sum(x=='train')/float(len(x)))
broadband_unq.head(10)
Out[92]:
broadband_type PercTrain
broadband_type_8 10268 0.316128
broadband_type_2 8918 0.319691
broadband_type_6 582 0.345361
broadband_type_7 498 0.359438
broadband_type_4 330 0.345455
broadband_type_9 190 0.321053
broadband_type_3 145 0.303448
broadband_type_10 73 0.383562
broadband_type_1 58 0.465517
broadband_type_5 14 0.214286

Determine the mode of each:

In [93]:
broadband_unq['Mode_outage_duration'] = broadband.loc[broadband['origin']=='train'].pivot_table(values='outage_duration',index='broadband_type', aggfunc=lambda x: mode(x).mode[0])
In [94]:
broadband.loc[broadband['broadband_type']=='broadband_2']
Out[94]:
id broadband_type area_code origin outage_duration
In [95]:
broadband_merge = broadband.pivot_table(values='origin',index='id',columns='broadband_type',aggfunc=lambda x: len(x),fill_value=0)
In [96]:
broadband_merge.head(2)
Out[96]:
broadband_type broadband_type_1 broadband_type_10 broadband_type_2 broadband_type_3 broadband_type_4 broadband_type_5 broadband_type_6 broadband_type_7 broadband_type_8 broadband_type_9
id
1 0 0 0 0 0 0 1 0 1 0
2 0 0 1 0 0 0 0 0 0 0
In [97]:
broadband_merge.shape
Out[97]:
(18552, 10)
In [98]:
broadband_data = data.merge(broadband_merge, left_on='id', right_index=True)
broadband_data.head(2)
Out[98]:
id area_code origin outage_duration broadband_type_1 broadband_type_10 broadband_type_2 broadband_type_3 broadband_type_4 broadband_type_5 broadband_type_6 broadband_type_7 broadband_type_8 broadband_type_9
0 13366 area_415 train 1.0 0 0 1 0 0 0 0 0 0 0
1 6783 area_474 train 0.0 0 0 1 0 0 0 0 0 0 0
In [99]:
print(broadband_data.shape)
(7381, 14)
In [100]:
broadband_temp=broadband_data.copy()

2) outage

In [101]:
outage['outage_type'].value_counts().head()
Out[101]:
outage_type_2    8737
outage_type_1    8728
outage_type_4    1014
outage_type_5      65
outage_type_3       8
Name: outage_type, dtype: int64
In [102]:
outage = outage.merge(data, on='id', how='left')
outage.head(2)
Out[102]:
id outage_type area_code origin outage_duration
0 6597 outage_type_2 NaN NaN NaN
1 8011 outage_type_2 area_1 train 0.0
In [103]:
outage_unq = pd.DataFrame(outage['outage_type'].value_counts())
outage_unq.head()
Out[103]:
outage_type
outage_type_2 8737
outage_type_1 8728
outage_type_4 1014
outage_type_5 65
outage_type_3 8
In [104]:
outage_unq['PercTrain'] = outage.pivot_table(values='origin',index='outage_type',aggfunc=lambda x: sum(x=='train')/float(len(x)))
outage_unq.head()
Out[104]:
outage_type PercTrain
outage_type_2 8737 0.329060
outage_type_1 8728 0.309349
outage_type_4 1014 0.301775
outage_type_5 65 0.292308
outage_type_3 8 0.500000

Determine the mode of each:

In [105]:
outage_unq['Mode_outage_duration'] = outage.loc[outage['origin']=='train'].pivot_table(values='outage_duration',index='outage_type', aggfunc=lambda x: mode(x).mode[0])

Merge preprocess into original and then into train:

In [106]:
outage_merge = outage.pivot_table(values='origin',index='id',columns='outage_type', aggfunc=lambda x:len(x), fill_value=0)
In [107]:
outage_merge.head(2)
Out[107]:
outage_type outage_type_1 outage_type_2 outage_type_3 outage_type_4 outage_type_5
id
1 1 0 0 0 0
2 0 1 0 0 0
In [108]:
outage_merge.shape
Out[108]:
(18552, 5)
In [109]:
broadband_outage = broadband_data.merge(outage_merge, left_on='id', right_index=True)
broadband_outage.head(2)
Out[109]:
id area_code origin outage_duration broadband_type_1 broadband_type_10 broadband_type_2 broadband_type_3 broadband_type_4 broadband_type_5 broadband_type_6 broadband_type_7 broadband_type_8 broadband_type_9 outage_type_1 outage_type_2 outage_type_3 outage_type_4 outage_type_5
0 13366 area_415 train 1.0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0
1 6783 area_474 train 0.0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
In [110]:
broadband_outage.shape
Out[110]:
(7381, 19)

3) Report

In [111]:
report['log_report_type'].value_counts()
Out[111]:
log_report_type_312    5267
log_report_type_232    4754
log_report_type_82     3472
log_report_type_203    2823
log_report_type_313    2145
                       ... 
log_report_type_385       1
log_report_type_208       1
log_report_type_32        1
log_report_type_210       1
log_report_type_341       1
Name: log_report_type, Length: 386, dtype: int64
In [112]:
report = report.merge(data,on='id',how='left')
report.head(2)
Out[112]:
id log_report_type volume area_code origin outage_duration
0 6597 log_report_type_68 6 NaN NaN NaN
1 8011 log_report_type_68 7 area_1 train 0.0
In [113]:
report_unq = pd.DataFrame(report['log_report_type'].value_counts())
report_unq.head()
Out[113]:
log_report_type
log_report_type_312 5267
log_report_type_232 4754
log_report_type_82 3472
log_report_type_203 2823
log_report_type_313 2145

Determine % of training samples:

In [114]:
report_unq['PercTrain'] = report.pivot_table(values='origin',index='log_report_type',aggfunc=lambda x: sum(x=='train')/float(len(x)))
report_unq.head(10)
Out[114]:
log_report_type PercTrain
log_report_type_312 5267 0.330169
log_report_type_232 4754 0.327303
log_report_type_82 3472 0.313364
log_report_type_203 2823 0.309600
log_report_type_313 2145 0.337529
log_report_type_233 1901 0.336665
log_report_type_307 1597 0.333125
log_report_type_54 1573 0.315957
log_report_type_170 1526 0.317169
log_report_type_71 1514 0.321664

Determine the mode of each:

In [115]:
report_unq['Mode_outage_duration'] = report.loc[report['origin']=='train'].pivot_table(values='outage_duration',index='log_report_type', aggfunc=lambda x: mode(x).mode[0])
In [116]:
len(report_unq)
Out[116]:
386

Initializing:

In [117]:
report_unq['preprocess'] = report_unq.index.values

Removing the ones in train

In [118]:
report_unq['preprocess'].loc[report_unq['PercTrain']==1] = np.nan
In [119]:
report_unq[-100:]
Out[119]:
log_report_type PercTrain Mode_outage_duration preprocess
log_report_type_242 2 0.5 0.0 log_report_type_242
log_report_type_259 2 0.5 0.0 log_report_type_259
log_report_type_147 2 0.5 2.0 log_report_type_147
log_report_type_92 2 0.0 NaN log_report_type_92
log_report_type_93 2 0.0 NaN log_report_type_93
... ... ... ... ...
log_report_type_385 1 0.0 NaN log_report_type_385
log_report_type_208 1 1.0 1.0 NaN
log_report_type_32 1 0.0 NaN log_report_type_32
log_report_type_210 1 1.0 2.0 NaN
log_report_type_341 1 1.0 0.0 NaN

100 rows × 4 columns

In [120]:
report_unq[:3]
Out[120]:
log_report_type PercTrain Mode_outage_duration preprocess
log_report_type_312 5267 0.330169 0.0 log_report_type_312
log_report_type_232 4754 0.327303 0.0 log_report_type_232
log_report_type_82 3472 0.313364 2.0 log_report_type_82
In [121]:
limit=128
report_unq['preprocess'].iloc[limit:]=report_unq['Mode_outage_duration'].iloc[limit:].apply(lambda x: 'Remove' if pd.isnull(x) else 'log_report others_%d'%int(x))
print (report_unq['preprocess'].value_counts())
report_unq
log_report others_0    109
Remove                  76
log_report others_1     48
log_report others_2     25
log_report_type_229      1
                      ... 
log_report_type_101      1
log_report_type_291      1
log_report_type_368      1
log_report_type_155      1
log_report_type_38       1
Name: preprocess, Length: 132, dtype: int64
Out[121]:
log_report_type PercTrain Mode_outage_duration preprocess
log_report_type_312 5267 0.330169 0.0 log_report_type_312
log_report_type_232 4754 0.327303 0.0 log_report_type_232
log_report_type_82 3472 0.313364 2.0 log_report_type_82
log_report_type_203 2823 0.309600 2.0 log_report_type_203
log_report_type_313 2145 0.337529 0.0 log_report_type_313
... ... ... ... ...
log_report_type_385 1 0.000000 NaN Remove
log_report_type_208 1 1.000000 1.0 log_report others_1
log_report_type_32 1 0.000000 NaN Remove
log_report_type_210 1 1.000000 2.0 log_report others_2
log_report_type_341 1 1.000000 0.0 log_report others_0

386 rows × 4 columns

Merge preprocess into original and then into train:

In [122]:
report_data = report.merge(report_unq[['preprocess']], left_on='log_report_type',right_index=True)
print(report_data.head())
report_data['preprocess'].value_counts()
       id     log_report_type  volume area_code origin  outage_duration  \
0    6597  log_report_type_68       6       NaN    NaN              NaN   
1    8011  log_report_type_68       7    area_1  train              0.0   
2    2597  log_report_type_68       1       NaN    NaN              NaN   
23   6914  log_report_type_68      11    area_1  train              0.0   
41  16416  log_report_type_68       4       NaN    NaN              NaN   

            preprocess  
0   log_report_type_68  
1   log_report_type_68  
2   log_report_type_68  
23  log_report_type_68  
41  log_report_type_68  
Out[122]:
log_report_type_312    5267
log_report_type_232    4754
log_report_type_82     3472
log_report_type_203    2823
log_report_type_313    2145
                       ... 
log_report_type_217      28
log_report_type_118      28
log_report_type_8        26
log_report_type_105      26
log_report_type_20       25
Name: preprocess, Length: 132, dtype: int64
In [123]:
report_merge = report_data.pivot_table(values='volume',index='id',columns='preprocess',aggfunc=lambda x: len(x),fill_value=0)
In [124]:
broadband_outage_report = broadband_outage.merge(report_merge, left_on='id', right_index=True)
broadband_outage_report.head(2)
Out[124]:
id area_code origin outage_duration broadband_type_1 broadband_type_10 broadband_type_2 broadband_type_3 broadband_type_4 broadband_type_5 ... log_report_type_8 log_report_type_80 log_report_type_81 log_report_type_82 log_report_type_83 log_report_type_85 log_report_type_86 log_report_type_87 log_report_type_94 log_report_type_95
0 13366 area_415 train 1.0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 6783 area_474 train 0.0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2 rows × 151 columns

In [125]:
broadband_outage_report.shape
Out[125]:
(7381, 151)

4) Server

In [126]:
server['transit_server_type'].value_counts().head()
Out[126]:
transit_server_type_11    7888
transit_server_type_35    6615
transit_server_type_34    5927
transit_server_type_15    4395
transit_server_type_20    1458
Name: transit_server_type, dtype: int64
In [127]:
server = server.merge(data,on='id', how='left')
server.head(2)
Out[127]:
id transit_server_type area_code origin outage_duration
0 6597 transit_server_type_11 NaN NaN NaN
1 8011 transit_server_type_15 area_1 train 0.0
In [128]:
server_unq = pd.DataFrame(server['transit_server_type'].value_counts())
server_unq.head()
Out[128]:
transit_server_type
transit_server_type_11 7888
transit_server_type_35 6615
transit_server_type_34 5927
transit_server_type_15 4395
transit_server_type_20 1458

Determine % of training samples:

In [129]:
server_unq['PercTrain'] = server.pivot_table(values='origin',index='transit_server_type',aggfunc=lambda x: sum(x=='train')/float(len(x)))
server_unq.head()
Out[129]:
transit_server_type PercTrain
transit_server_type_11 7888 0.313514
transit_server_type_35 6615 0.326228
transit_server_type_34 5927 0.324785
transit_server_type_15 4395 0.311718
transit_server_type_20 1458 0.299726

Determine the mode of each

In [130]:
server_unq['Mode_outage_duration'] = server.loc[server['origin']=='train'].pivot_table(values='outage_duration',index='transit_server_type', aggfunc=lambda x: mode(x).mode[0])
In [131]:
server_unq.iloc[-20:]
Out[131]:
transit_server_type PercTrain Mode_outage_duration
transit_server_type_3 19 0.315789 2.0
transit_server_type_38 19 0.421053 0.0
transit_server_type_39 18 0.277778 1.0
transit_server_type_36 18 0.388889 0.0
transit_server_type_49 17 0.294118 2.0
transit_server_type_53 17 0.235294 2.0
transit_server_type_9 14 0.214286 0.0
transit_server_type_19 14 0.214286 2.0
transit_server_type_31 10 0.300000 0.0
transit_server_type_37 10 0.400000 0.0
transit_server_type_12 6 0.333333 0.0
transit_server_type_25 5 0.200000 0.0
transit_server_type_51 4 0.750000 0.0
transit_server_type_1 4 0.250000 0.0
transit_server_type_17 3 0.000000 NaN
transit_server_type_48 2 0.500000 0.0
transit_server_type_41 2 0.500000 0.0
transit_server_type_4 1 0.000000 NaN
transit_server_type_52 1 0.000000 NaN
transit_server_type_33 1 0.000000 NaN
In [132]:
server_unq.nunique()
Out[132]:
transit_server_type     44
PercTrain               42
Mode_outage_duration     3
dtype: int64

Initializing:

In [133]:
server_unq['preprocess'] = server_unq.index.values

Removing the ones in train

In [134]:
server_unq['preprocess'].loc[server_unq['PercTrain']==0] = 'Remove'

Replacing the lower ones with mode:

In [135]:
top_unchange = 33
server_unq['preprocess'].iloc[top_unchange:] = server_unq['Mode_outage_duration'].iloc[top_unchange:].apply(lambda x: 'Remove' if pd.isnull(x) else 'transit_server others_%d'%int(x))
In [136]:
server_unq['preprocess'].iloc[33:]
Out[136]:
transit_server_type_3     transit_server others_2
transit_server_type_38    transit_server others_0
transit_server_type_39    transit_server others_1
transit_server_type_36    transit_server others_0
transit_server_type_49    transit_server others_2
transit_server_type_53    transit_server others_2
transit_server_type_9     transit_server others_0
transit_server_type_19    transit_server others_2
transit_server_type_31    transit_server others_0
transit_server_type_37    transit_server others_0
transit_server_type_12    transit_server others_0
transit_server_type_25    transit_server others_0
transit_server_type_51    transit_server others_0
transit_server_type_1     transit_server others_0
transit_server_type_17                     Remove
transit_server_type_48    transit_server others_0
transit_server_type_41    transit_server others_0
transit_server_type_4                      Remove
transit_server_type_52                     Remove
transit_server_type_33                     Remove
Name: preprocess, dtype: object
In [137]:
server_data = server.merge(server_unq[['preprocess']], left_on='transit_server_type',right_index=True)
print(server_data.head(2))
server_data['preprocess'].value_counts()
     id     transit_server_type area_code origin  outage_duration  \
0  6597  transit_server_type_11       NaN    NaN              NaN   
4  5022  transit_server_type_11       NaN    NaN              NaN   

               preprocess  
0  transit_server_type_11  
4  transit_server_type_11  
Out[137]:
transit_server_type_11     7888
transit_server_type_35     6615
transit_server_type_34     5927
transit_server_type_15     4395
transit_server_type_20     1458
transit_server_type_54      684
transit_server_type_13      582
transit_server_type_42      478
transit_server_type_44      466
transit_server_type_23      429
transit_server_type_14      330
transit_server_type_43      306
transit_server_type_22      223
transit_server_type_50      154
transit_server_type_10      145
transit_server_type_21      136
transit_server others_0      94
transit_server_type_18       73
transit_server_type_47       69
transit_server others_2      67
transit_server_type_26       65
transit_server_type_32       63
transit_server_type_30       60
transit_server_type_45       53
transit_server_type_24       46
transit_server_type_27       44
transit_server_type_29       42
transit_server_type_40       40
transit_server_type_46       38
transit_server_type_2        37
transit_server_type_28       32
transit_server_type_8        29
transit_server_type_6        28
transit_server_type_5        26
transit_server_type_7        24
transit_server others_1      18
Remove                        6
Name: preprocess, dtype: int64
In [138]:
server_merge = server_data.pivot_table(values='origin',index='id',columns='preprocess',aggfunc=lambda x: len(x),fill_value=0)
In [139]:
final_data = broadband_outage_report.merge(server_merge, left_on='id', right_index=True)
final_data.head(2)
Out[139]:
id area_code origin outage_duration broadband_type_1 broadband_type_10 broadband_type_2 broadband_type_3 broadband_type_4 broadband_type_5 ... transit_server_type_44 transit_server_type_45 transit_server_type_46 transit_server_type_47 transit_server_type_5 transit_server_type_50 transit_server_type_54 transit_server_type_6 transit_server_type_7 transit_server_type_8
0 13366 area_415 train 1.0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 6783 area_474 train 0.0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2 rows × 188 columns

In [140]:
final_data.shape
Out[140]:
(7381, 188)

Removing the extra columns & splits added for merging

In [141]:
[x for x in data.columns if 'Remove' in x]
Out[141]:
[]
In [142]:
final_data.drop(['Remove_x','Remove_y'],axis=1,inplace=True)
In [143]:
train = final_data.loc[final_data['origin']=='train']
test = final_data.loc[final_data['origin']=='test']
In [144]:
train.drop('origin',axis=1,inplace=True)
test.drop(['origin','outage_duration'],axis=1,inplace=True)
In [145]:
train.dtypes
Out[145]:
id                          int64
area_code                  object
outage_duration           float64
broadband_type_1            int64
broadband_type_10           int64
                           ...   
transit_server_type_50      int64
transit_server_type_54      int64
transit_server_type_6       int64
transit_server_type_7       int64
transit_server_type_8       int64
Length: 185, dtype: object
In [146]:
train.head()
Out[146]:
id area_code outage_duration broadband_type_1 broadband_type_10 broadband_type_2 broadband_type_3 broadband_type_4 broadband_type_5 broadband_type_6 ... transit_server_type_44 transit_server_type_45 transit_server_type_46 transit_server_type_47 transit_server_type_5 transit_server_type_50 transit_server_type_54 transit_server_type_6 transit_server_type_7 transit_server_type_8
0 13366 area_415 1.0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 6783 area_474 0.0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 9519 area_931 1.0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 10202 area_700 1.0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 4555 area_600 2.0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 185 columns

In [147]:
train.tail()
Out[147]:
id area_code outage_duration broadband_type_1 broadband_type_10 broadband_type_2 broadband_type_3 broadband_type_4 broadband_type_5 broadband_type_6 ... transit_server_type_44 transit_server_type_45 transit_server_type_46 transit_server_type_47 transit_server_type_5 transit_server_type_50 transit_server_type_54 transit_server_type_6 transit_server_type_7 transit_server_type_8
5899 1910 area_403 1.0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5900 10475 area_821 0.0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5901 10675 area_798 2.0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5902 14714 area_210 0.0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5903 6253 area_1007 0.0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 185 columns

In [148]:
train.shape, test.shape
Out[148]:
((5904, 185), (1477, 184))

Understanding the data

In [149]:
train.dtypes
Out[149]:
id                          int64
area_code                  object
outage_duration           float64
broadband_type_1            int64
broadband_type_10           int64
                           ...   
transit_server_type_50      int64
transit_server_type_54      int64
transit_server_type_6       int64
transit_server_type_7       int64
transit_server_type_8       int64
Length: 185, dtype: object
In [150]:
train.describe()
Out[150]:
id outage_duration broadband_type_1 broadband_type_10 broadband_type_2 broadband_type_3 broadband_type_4 broadband_type_5 broadband_type_6 broadband_type_7 ... transit_server_type_44 transit_server_type_45 transit_server_type_46 transit_server_type_47 transit_server_type_5 transit_server_type_50 transit_server_type_54 transit_server_type_6 transit_server_type_7 transit_server_type_8
count 5904.000000 5904.000000 5904.000000 5904.000000 5904.000000 5904.000000 5904.000000 5904.000000 5904.000000 5904.000000 ... 5904.000000 5904.000000 5904.000000 5904.000000 5904.000000 5904.000000 5904.000000 5904.000000 5904.000000 5904.000000
mean 9323.841972 0.450203 0.004573 0.004743 0.482893 0.007453 0.019309 0.000508 0.034045 0.030318 ... 0.024051 0.003049 0.002710 0.004065 0.002202 0.006267 0.034383 0.002033 0.002202 0.001186
std 5338.654942 0.666642 0.067476 0.068708 0.499750 0.086013 0.137620 0.022538 0.181359 0.171477 ... 0.153222 0.055136 0.051992 0.063633 0.046877 0.078922 0.182228 0.045041 0.046877 0.034416
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 4763.500000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 9330.500000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 13932.750000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 18550.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 184 columns

In [151]:
train.describe(include='object')
Out[151]:
area_code
count 5904
unique 876
top area_821
freq 67
In [152]:
train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5904 entries, 0 to 5903
Columns: 185 entries, id to transit_server_type_8
dtypes: float64(1), int64(183), object(1)
memory usage: 8.4+ MB

Checking for missing values

In [153]:
train.isnull().sum()
Out[153]:
id                        0
area_code                 0
outage_duration           0
broadband_type_1          0
broadband_type_10         0
                         ..
transit_server_type_50    0
transit_server_type_54    0
transit_server_type_6     0
transit_server_type_7     0
transit_server_type_8     0
Length: 185, dtype: int64

Typecasting or conversion of object type features to categorical variables

In [154]:
for col in ['area_code']:
    train[col]=train[col].astype('category')
In [155]:
for col in ['area_code']:
    test[col] = test[col].astype('category')
In [156]:
cat_attr = list(train.select_dtypes("category").columns)
num_attr = list(train.columns.difference(cat_attr))
In [157]:
num_attr
Out[157]:
['broadband_type_1',
 'broadband_type_10',
 'broadband_type_2',
 'broadband_type_3',
 'broadband_type_4',
 'broadband_type_5',
 'broadband_type_6',
 'broadband_type_7',
 'broadband_type_8',
 'broadband_type_9',
 'id',
 'log_report others_0',
 'log_report others_1',
 'log_report others_2',
 'log_report_type_101',
 'log_report_type_103',
 'log_report_type_105',
 'log_report_type_109',
 'log_report_type_118',
 'log_report_type_132',
 'log_report_type_133',
 'log_report_type_134',
 'log_report_type_135',
 'log_report_type_153',
 'log_report_type_154',
 'log_report_type_155',
 'log_report_type_157',
 'log_report_type_160',
 'log_report_type_161',
 'log_report_type_163',
 'log_report_type_167',
 'log_report_type_170',
 'log_report_type_171',
 'log_report_type_172',
 'log_report_type_179',
 'log_report_type_181',
 'log_report_type_182',
 'log_report_type_188',
 'log_report_type_191',
 'log_report_type_193',
 'log_report_type_195',
 'log_report_type_196',
 'log_report_type_197',
 'log_report_type_198',
 'log_report_type_20',
 'log_report_type_201',
 'log_report_type_202',
 'log_report_type_203',
 'log_report_type_204',
 'log_report_type_205',
 'log_report_type_206',
 'log_report_type_207',
 'log_report_type_209',
 'log_report_type_211',
 'log_report_type_212',
 'log_report_type_217',
 'log_report_type_218',
 'log_report_type_219',
 'log_report_type_220',
 'log_report_type_221',
 'log_report_type_222',
 'log_report_type_223',
 'log_report_type_227',
 'log_report_type_228',
 'log_report_type_229',
 'log_report_type_230',
 'log_report_type_231',
 'log_report_type_232',
 'log_report_type_233',
 'log_report_type_234',
 'log_report_type_235',
 'log_report_type_236',
 'log_report_type_239',
 'log_report_type_240',
 'log_report_type_273',
 'log_report_type_276',
 'log_report_type_280',
 'log_report_type_283',
 'log_report_type_284',
 'log_report_type_285',
 'log_report_type_289',
 'log_report_type_290',
 'log_report_type_291',
 'log_report_type_293',
 'log_report_type_301',
 'log_report_type_304',
 'log_report_type_305',
 'log_report_type_306',
 'log_report_type_307',
 'log_report_type_308',
 'log_report_type_309',
 'log_report_type_310',
 'log_report_type_311',
 'log_report_type_312',
 'log_report_type_313',
 'log_report_type_314',
 'log_report_type_315',
 'log_report_type_316',
 'log_report_type_318',
 'log_report_type_345',
 'log_report_type_349',
 'log_report_type_35',
 'log_report_type_353',
 'log_report_type_354',
 'log_report_type_360',
 'log_report_type_362',
 'log_report_type_368',
 'log_report_type_370',
 'log_report_type_374',
 'log_report_type_375',
 'log_report_type_376',
 'log_report_type_378',
 'log_report_type_38',
 'log_report_type_39',
 'log_report_type_42',
 'log_report_type_44',
 'log_report_type_47',
 'log_report_type_51',
 'log_report_type_52',
 'log_report_type_54',
 'log_report_type_55',
 'log_report_type_56',
 'log_report_type_62',
 'log_report_type_63',
 'log_report_type_66',
 'log_report_type_68',
 'log_report_type_70',
 'log_report_type_71',
 'log_report_type_73',
 'log_report_type_74',
 'log_report_type_75',
 'log_report_type_76',
 'log_report_type_8',
 'log_report_type_80',
 'log_report_type_81',
 'log_report_type_82',
 'log_report_type_83',
 'log_report_type_85',
 'log_report_type_86',
 'log_report_type_87',
 'log_report_type_94',
 'log_report_type_95',
 'outage_duration',
 'outage_type_1',
 'outage_type_2',
 'outage_type_3',
 'outage_type_4',
 'outage_type_5',
 'transit_server others_0',
 'transit_server others_1',
 'transit_server others_2',
 'transit_server_type_10',
 'transit_server_type_11',
 'transit_server_type_13',
 'transit_server_type_14',
 'transit_server_type_15',
 'transit_server_type_18',
 'transit_server_type_2',
 'transit_server_type_20',
 'transit_server_type_21',
 'transit_server_type_22',
 'transit_server_type_23',
 'transit_server_type_24',
 'transit_server_type_26',
 'transit_server_type_27',
 'transit_server_type_28',
 'transit_server_type_29',
 'transit_server_type_30',
 'transit_server_type_32',
 'transit_server_type_34',
 'transit_server_type_35',
 'transit_server_type_40',
 'transit_server_type_42',
 'transit_server_type_43',
 'transit_server_type_44',
 'transit_server_type_45',
 'transit_server_type_46',
 'transit_server_type_47',
 'transit_server_type_5',
 'transit_server_type_50',
 'transit_server_type_54',
 'transit_server_type_6',
 'transit_server_type_7',
 'transit_server_type_8']
In [158]:
cat_attr
Out[158]:
['area_code']
In [159]:
for col in ['outage_duration']:
    train[col] = train[col].astype(int)
In [160]:
train.dtypes
Out[160]:
id                           int64
area_code                 category
outage_duration              int64
broadband_type_1             int64
broadband_type_10            int64
                            ...   
transit_server_type_50       int64
transit_server_type_54       int64
transit_server_type_6        int64
transit_server_type_7        int64
transit_server_type_8        int64
Length: 185, dtype: object
In [161]:
train.isnull().sum()
Out[161]:
id                        0
area_code                 0
outage_duration           0
broadband_type_1          0
broadband_type_10         0
                         ..
transit_server_type_50    0
transit_server_type_54    0
transit_server_type_6     0
transit_server_type_7     0
transit_server_type_8     0
Length: 185, dtype: int64

Feature Engineering

Understanding the location/areas pattern and frequency of its occurence

In [162]:
loc_counter = {}
def loc_count(x):
    if x['area_code'] in loc_counter:
        loc_counter[x['area_code']]+=1
    else:
        loc_counter[x['area_code']]=1
    return loc_counter[x['area_code']]
train['location_occurence_count'] = train[['id','area_code']].apply(loc_count,axis=1)
test['location_occurence_count'] = test[['id','area_code']].apply(loc_count, axis=1)
In [163]:
locbool = train['area_code']=='area_126'
train.loc[locbool,['id','area_code','location_occurence_count']].head(5)
Out[163]:
id area_code location_occurence_count
95 199 area_126 1
230 2987 area_126 2
305 831 area_126 3
555 12043 area_126 4
613 7697 area_126 5
In [164]:
locbool = test['area_code']=='area_126'
test.loc[locbool,['id','area_code','location_occurence_count']].head()
Out[164]:
id area_code location_occurence_count
5952 4905 area_126 60
5995 2835 area_126 61
6417 13082 area_126 62
6555 17207 area_126 63
6565 16368 area_126 64
In [165]:
train.loc[:,['id','area_code','location_occurence_count']].head(100)
Out[165]:
id area_code location_occurence_count
0 13366 area_415 1
1 6783 area_474 1
2 9519 area_931 1
3 10202 area_700 1
4 4555 area_600 1
... ... ... ...
95 199 area_126 1
96 18395 area_846 3
97 14106 area_852 1
98 8950 area_1033 1
99 2707 area_793 3

100 rows × 3 columns

In [166]:
test.loc[:,['id','area_code','location_occurence_count']].head(100)
Out[166]:
id area_code location_occurence_count
5904 3340 area_344 12
5905 14067 area_933 1
5906 1134 area_16 10
5907 27 area_793 43
5908 9793 area_344 13
... ... ... ...
5999 2493 area_444 4
6000 12463 area_9 22
6001 11860 area_468 19
6002 12364 area_468 20
6003 9497 area_242 45

100 rows × 3 columns

In [167]:
train.head()
Out[167]:
id area_code outage_duration broadband_type_1 broadband_type_10 broadband_type_2 broadband_type_3 broadband_type_4 broadband_type_5 broadband_type_6 ... transit_server_type_45 transit_server_type_46 transit_server_type_47 transit_server_type_5 transit_server_type_50 transit_server_type_54 transit_server_type_6 transit_server_type_7 transit_server_type_8 location_occurence_count
0 13366 area_415 1 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 6783 area_474 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
2 9519 area_931 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 10202 area_700 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
4 4555 area_600 2 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 186 columns

In [168]:
test.head()
Out[168]:
id area_code broadband_type_1 broadband_type_10 broadband_type_2 broadband_type_3 broadband_type_4 broadband_type_5 broadband_type_6 broadband_type_7 ... transit_server_type_45 transit_server_type_46 transit_server_type_47 transit_server_type_5 transit_server_type_50 transit_server_type_54 transit_server_type_6 transit_server_type_7 transit_server_type_8 location_occurence_count
5904 3340 area_344 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 12
5905 14067 area_933 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
5906 1134 area_16 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 10
5907 27 area_793 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 43
5908 9793 area_344 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 13

5 rows × 185 columns

Checking the categorical & numerical features

In [169]:
cat_attr = list(train.select_dtypes("category").columns)
num_attr = list(train.columns.difference(cat_attr))
In [170]:
cat_attr
Out[170]:
['area_code']
In [171]:
train.dtypes
Out[171]:
id                             int64
area_code                   category
outage_duration                int64
broadband_type_1               int64
broadband_type_10              int64
                              ...   
transit_server_type_54         int64
transit_server_type_6          int64
transit_server_type_7          int64
transit_server_type_8          int64
location_occurence_count       int64
Length: 186, dtype: object
In [172]:
num_attr
Out[172]:
['broadband_type_1',
 'broadband_type_10',
 'broadband_type_2',
 'broadband_type_3',
 'broadband_type_4',
 'broadband_type_5',
 'broadband_type_6',
 'broadband_type_7',
 'broadband_type_8',
 'broadband_type_9',
 'id',
 'location_occurence_count',
 'log_report others_0',
 'log_report others_1',
 'log_report others_2',
 'log_report_type_101',
 'log_report_type_103',
 'log_report_type_105',
 'log_report_type_109',
 'log_report_type_118',
 'log_report_type_132',
 'log_report_type_133',
 'log_report_type_134',
 'log_report_type_135',
 'log_report_type_153',
 'log_report_type_154',
 'log_report_type_155',
 'log_report_type_157',
 'log_report_type_160',
 'log_report_type_161',
 'log_report_type_163',
 'log_report_type_167',
 'log_report_type_170',
 'log_report_type_171',
 'log_report_type_172',
 'log_report_type_179',
 'log_report_type_181',
 'log_report_type_182',
 'log_report_type_188',
 'log_report_type_191',
 'log_report_type_193',
 'log_report_type_195',
 'log_report_type_196',
 'log_report_type_197',
 'log_report_type_198',
 'log_report_type_20',
 'log_report_type_201',
 'log_report_type_202',
 'log_report_type_203',
 'log_report_type_204',
 'log_report_type_205',
 'log_report_type_206',
 'log_report_type_207',
 'log_report_type_209',
 'log_report_type_211',
 'log_report_type_212',
 'log_report_type_217',
 'log_report_type_218',
 'log_report_type_219',
 'log_report_type_220',
 'log_report_type_221',
 'log_report_type_222',
 'log_report_type_223',
 'log_report_type_227',
 'log_report_type_228',
 'log_report_type_229',
 'log_report_type_230',
 'log_report_type_231',
 'log_report_type_232',
 'log_report_type_233',
 'log_report_type_234',
 'log_report_type_235',
 'log_report_type_236',
 'log_report_type_239',
 'log_report_type_240',
 'log_report_type_273',
 'log_report_type_276',
 'log_report_type_280',
 'log_report_type_283',
 'log_report_type_284',
 'log_report_type_285',
 'log_report_type_289',
 'log_report_type_290',
 'log_report_type_291',
 'log_report_type_293',
 'log_report_type_301',
 'log_report_type_304',
 'log_report_type_305',
 'log_report_type_306',
 'log_report_type_307',
 'log_report_type_308',
 'log_report_type_309',
 'log_report_type_310',
 'log_report_type_311',
 'log_report_type_312',
 'log_report_type_313',
 'log_report_type_314',
 'log_report_type_315',
 'log_report_type_316',
 'log_report_type_318',
 'log_report_type_345',
 'log_report_type_349',
 'log_report_type_35',
 'log_report_type_353',
 'log_report_type_354',
 'log_report_type_360',
 'log_report_type_362',
 'log_report_type_368',
 'log_report_type_370',
 'log_report_type_374',
 'log_report_type_375',
 'log_report_type_376',
 'log_report_type_378',
 'log_report_type_38',
 'log_report_type_39',
 'log_report_type_42',
 'log_report_type_44',
 'log_report_type_47',
 'log_report_type_51',
 'log_report_type_52',
 'log_report_type_54',
 'log_report_type_55',
 'log_report_type_56',
 'log_report_type_62',
 'log_report_type_63',
 'log_report_type_66',
 'log_report_type_68',
 'log_report_type_70',
 'log_report_type_71',
 'log_report_type_73',
 'log_report_type_74',
 'log_report_type_75',
 'log_report_type_76',
 'log_report_type_8',
 'log_report_type_80',
 'log_report_type_81',
 'log_report_type_82',
 'log_report_type_83',
 'log_report_type_85',
 'log_report_type_86',
 'log_report_type_87',
 'log_report_type_94',
 'log_report_type_95',
 'outage_duration',
 'outage_type_1',
 'outage_type_2',
 'outage_type_3',
 'outage_type_4',
 'outage_type_5',
 'transit_server others_0',
 'transit_server others_1',
 'transit_server others_2',
 'transit_server_type_10',
 'transit_server_type_11',
 'transit_server_type_13',
 'transit_server_type_14',
 'transit_server_type_15',
 'transit_server_type_18',
 'transit_server_type_2',
 'transit_server_type_20',
 'transit_server_type_21',
 'transit_server_type_22',
 'transit_server_type_23',
 'transit_server_type_24',
 'transit_server_type_26',
 'transit_server_type_27',
 'transit_server_type_28',
 'transit_server_type_29',
 'transit_server_type_30',
 'transit_server_type_32',
 'transit_server_type_34',
 'transit_server_type_35',
 'transit_server_type_40',
 'transit_server_type_42',
 'transit_server_type_43',
 'transit_server_type_44',
 'transit_server_type_45',
 'transit_server_type_46',
 'transit_server_type_47',
 'transit_server_type_5',
 'transit_server_type_50',
 'transit_server_type_54',
 'transit_server_type_6',
 'transit_server_type_7',
 'transit_server_type_8']
In [173]:
num_attr.pop(143)
Out[173]:
'outage_duration'

Label encoding to pass through the model

In [174]:
lb = LabelEncoder()

Converting the area code to string data type for label encoding

In [175]:
for col in ['area_code']:
    train[col] = train[col].astype('str')
In [176]:
for col in ['area_code']:
    test[col] = test[col].astype('str')
In [177]:
train['area_code'] = lb.fit_transform(train['area_code'])
In [178]:
test['area_code'] = lb.fit_transform(test['area_code'])

Checking the size and missing values in data before passing through the model

In [179]:
train.shape, test.shape
Out[179]:
((5904, 186), (1477, 185))
In [180]:
train.head(2)
Out[180]:
id area_code outage_duration broadband_type_1 broadband_type_10 broadband_type_2 broadband_type_3 broadband_type_4 broadband_type_5 broadband_type_6 ... transit_server_type_45 transit_server_type_46 transit_server_type_47 transit_server_type_5 transit_server_type_50 transit_server_type_54 transit_server_type_6 transit_server_type_7 transit_server_type_8 location_occurence_count
0 13366 382 1 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 6783 431 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1

2 rows × 186 columns

In [181]:
test.head(2)
Out[181]:
id area_code broadband_type_1 broadband_type_10 broadband_type_2 broadband_type_3 broadband_type_4 broadband_type_5 broadband_type_6 broadband_type_7 ... transit_server_type_45 transit_server_type_46 transit_server_type_47 transit_server_type_5 transit_server_type_50 transit_server_type_54 transit_server_type_6 transit_server_type_7 transit_server_type_8 location_occurence_count
5904 3340 190 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 12
5905 14067 494 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1

2 rows × 185 columns

In [182]:
train.isnull().sum()
Out[182]:
id                          0
area_code                   0
outage_duration             0
broadband_type_1            0
broadband_type_10           0
                           ..
transit_server_type_54      0
transit_server_type_6       0
transit_server_type_7       0
transit_server_type_8       0
location_occurence_count    0
Length: 186, dtype: int64

There are no missing values in the train data.

In [183]:
train.dtypes
Out[183]:
id                          int64
area_code                   int64
outage_duration             int64
broadband_type_1            int64
broadband_type_10           int64
                            ...  
transit_server_type_54      int64
transit_server_type_6       int64
transit_server_type_7       int64
transit_server_type_8       int64
location_occurence_count    int64
Length: 186, dtype: object

Model Building

Splitting the train dataset into train & validation

In [184]:
X, y = train.loc[:,train.columns!='outage_duration'], train.loc[:,'outage_duration']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify =y,random_state=3)
In [185]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[185]:
((4132, 185), (1772, 185), (4132,), (1772,))

1) Logistic Regression

In [186]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Out[186]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [187]:
train_pred = logreg.predict(X_train)
test_pred = logreg.predict(X_test)
In [188]:
print('Confusion matrix \n')
print(metrics.confusion_matrix(y_test,test_pred))
print('*'*80)
print('\n')
print('Classification report \n')
print(metrics.classification_report(y_train, train_pred))
print(metrics.classification_report(y_test,test_pred))
Confusion matrix 

[[1113    0   36]
 [ 432    0   17]
 [ 159    0   15]]
********************************************************************************


Classification report 

              precision    recall  f1-score   support

           0       0.66      0.98      0.79      2678
           1       0.50      0.00      0.00      1047
           2       0.30      0.11      0.16       407

    accuracy                           0.64      4132
   macro avg       0.48      0.36      0.32      4132
weighted avg       0.58      0.64      0.53      4132

              precision    recall  f1-score   support

           0       0.65      0.97      0.78      1149
           1       0.00      0.00      0.00       449
           2       0.22      0.09      0.12       174

    accuracy                           0.64      1772
   macro avg       0.29      0.35      0.30      1772
weighted avg       0.45      0.64      0.52      1772

In [189]:
model1_log = logreg.predict(test)
model1_log
Out[189]:
array([0, 0, 2, ..., 0, 0, 0])
In [190]:
final= pd.DataFrame({'ID' : test['id'],
                     'Outage_Duration' : model1_log})
In [191]:
final.to_csv("submission_log_1.csv",index=False)

2) Random Forest

In [192]:
rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)
Out[192]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
In [193]:
train_pred = rfc.predict(X_train)
test_pred = rfc.predict(X_test)
In [194]:
print('Confusion matrix \n')
print(metrics.confusion_matrix(y_test,test_pred))
print('*'*80)
print('\n')
print('Classification report \n')
print(metrics.classification_report(y_train, train_pred))
print(metrics.classification_report(y_test,test_pred))
Confusion matrix 

[[1001  119   29]
 [ 209  184   56]
 [  40   54   80]]
********************************************************************************


Classification report 

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2678
           1       1.00      1.00      1.00      1047
           2       1.00      1.00      1.00       407

    accuracy                           1.00      4132
   macro avg       1.00      1.00      1.00      4132
weighted avg       1.00      1.00      1.00      4132

              precision    recall  f1-score   support

           0       0.80      0.87      0.83      1149
           1       0.52      0.41      0.46       449
           2       0.48      0.46      0.47       174

    accuracy                           0.71      1772
   macro avg       0.60      0.58      0.59      1772
weighted avg       0.70      0.71      0.70      1772

In [195]:
model1_rf = rfc.predict(test)
model1_rf
Out[195]:
array([0, 0, 1, ..., 0, 1, 1])
In [196]:
final2= pd.DataFrame({'ID' : test['id'],
                     'Outage_Duration' : model1_rf})
In [197]:
final2.to_csv("submission_rf_1.csv",index=False)
In [198]:
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
In [199]:
folds = KFold(n_splits= 5, shuffle= True, random_state= 101)
In [200]:
params = {'max_depth':[1, 2, 3, 5],
         'n_estimators':[500,550,600,650],
         'min_samples_leaf': [100, 150, 200, 250], 
         'min_samples_split': [150, 200, 250, 300],
         'class_weight':['balanced']}

rf = RandomForestClassifier()
In [201]:
rf_fin = GridSearchCV(estimator= rf, cv = folds, param_grid= params, scoring= 'f1_macro', return_train_score= True)
In [202]:
rf_fin.fit(X_train,y_train)
Out[202]:
GridSearchCV(cv=KFold(n_splits=5, random_state=101, shuffle=True),
             error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_f...
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'class_weight': ['balanced'],
                         'max_depth': [1, 2, 3, 5],
                         'min_samples_leaf': [100, 150, 200, 250],
                         'min_samples_split': [150, 200, 250, 300],
                         'n_estimators': [500, 550, 600, 650]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='f1_macro', verbose=0)
In [203]:
scores = rf_fin.cv_results_

scores = pd.DataFrame(scores)

scores.head()
Out[203]:
mean_fit_time std_fit_time mean_score_time std_score_time param_class_weight param_max_depth param_min_samples_leaf param_min_samples_split param_n_estimators params ... mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score split3_train_score split4_train_score mean_train_score std_train_score
0 1.015729 0.051704 0.070637 0.009324 balanced 1 100 150 500 {'class_weight': 'balanced', 'max_depth': 1, '... ... 0.488526 0.020805 49 0.484167 0.490745 0.491622 0.495505 0.484583 0.489324 0.004349
1 1.239734 0.189969 0.078926 0.014421 balanced 1 100 150 550 {'class_weight': 'balanced', 'max_depth': 1, '... ... 0.488464 0.019076 50 0.482463 0.495270 0.488284 0.490775 0.485739 0.488506 0.004362
2 1.491522 0.153474 0.100334 0.037640 balanced 1 100 150 600 {'class_weight': 'balanced', 'max_depth': 1, '... ... 0.488299 0.024109 56 0.492697 0.499189 0.489900 0.480470 0.485411 0.489533 0.006364
3 1.317365 0.140765 0.098071 0.017055 balanced 1 100 150 650 {'class_weight': 'balanced', 'max_depth': 1, '... ... 0.487542 0.019858 70 0.480453 0.492821 0.491172 0.491102 0.484855 0.488081 0.004685
4 0.933253 0.098457 0.065336 0.004568 balanced 1 100 200 500 {'class_weight': 'balanced', 'max_depth': 1, '... ... 0.486616 0.020903 79 0.490207 0.497805 0.482106 0.489270 0.488165 0.489510 0.005020

5 rows × 25 columns

In [204]:
print('The best score was achieved using the parameters: {}'.format(rf_fin.best_params_))
The best score was achieved using the parameters: {'class_weight': 'balanced', 'max_depth': 5, 'min_samples_leaf': 100, 'min_samples_split': 150, 'n_estimators': 500}
In [205]:
random_final = RandomForestClassifier(max_depth= 5,
                                      min_samples_leaf= 100,
                                      min_samples_split= 150,
                                      n_estimators= 500)
In [206]:
random_final.fit(X_train,y_train)
Out[206]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=100, min_samples_split=150,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
In [207]:
train_pred = random_final.predict(X_train)
test_pred = random_final.predict(X_test)
In [208]:
print('Confusion matrix \n')
print(metrics.confusion_matrix(y_test,test_pred))
print('*'*80)
print('\n')
print('Classification report \n')
print(metrics.classification_report(y_train, train_pred))
print(metrics.classification_report(y_test,test_pred))
Confusion matrix 

[[1149    0    0]
 [ 449    0    0]
 [ 174    0    0]]
********************************************************************************


Classification report 

              precision    recall  f1-score   support

           0       0.65      1.00      0.79      2678
           1       0.00      0.00      0.00      1047
           2       0.00      0.00      0.00       407

    accuracy                           0.65      4132
   macro avg       0.22      0.33      0.26      4132
weighted avg       0.42      0.65      0.51      4132

              precision    recall  f1-score   support

           0       0.65      1.00      0.79      1149
           1       0.00      0.00      0.00       449
           2       0.00      0.00      0.00       174

    accuracy                           0.65      1772
   macro avg       0.22      0.33      0.26      1772
weighted avg       0.42      0.65      0.51      1772

3) Decision Tree

In [209]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
Out[209]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
In [210]:
dt.get_params().keys()
Out[210]:
dict_keys(['ccp_alpha', 'class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'presort', 'random_state', 'splitter'])
In [211]:
dt_params = {'criterion': ['entropy', 'gini'],
             'max_depth': [6,8,9,10],
             'min_samples_split': [2, 3, 4],
             'min_samples_leaf': [1, 2, 3],
             'class_weight':['balanced']}
In [212]:
dt_grid = GridSearchCV(dt, param_grid=dt_params, cv=10, scoring = 'f1_macro')
In [213]:
%%time
dt_grid.fit(X_train,y_train)
CPU times: user 20.8 s, sys: 250 ms, total: 21 s
Wall time: 21 s
Out[213]:
GridSearchCV(cv=10, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=None,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid={'class_weight': ['balanced'],
                         'criterion': ['entropy', 'gini'],
                         'max_depth': [6, 8, 9, 10],
                         'min_samples_leaf': [1, 2, 3],
                         'min_samples_split': [2, 3, 4]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1_macro', verbose=0)
In [214]:
dt_grid.best_params_
Out[214]:
{'class_weight': 'balanced',
 'criterion': 'gini',
 'max_depth': 9,
 'min_samples_leaf': 1,
 'min_samples_split': 3}
In [215]:
train_pred = dt_grid.predict(X_train)
test_pred = dt_grid.predict(X_test)
In [216]:
print(dt_grid.score(X_train, y_train))
print(dt_grid.score(X_test, y_test))
print("\n")
print(confusion_matrix(y_true=y_train, y_pred = train_pred))

confusion_matrix_test = confusion_matrix(y_true=y_test, y_pred =  test_pred)
confusion_matrix_test
0.6826414445817504
0.6124912771097852


[[2105  422  151]
 [ 274  614  159]
 [   9   58  340]]
Out[216]:
array([[899, 188,  62],
       [133, 230,  86],
       [ 11,  51, 112]])
In [217]:
print('Classification report \n')
print(metrics.classification_report(y_train, train_pred))
print(metrics.classification_report(y_test,test_pred))
Classification report 

              precision    recall  f1-score   support

           0       0.88      0.79      0.83      2678
           1       0.56      0.59      0.57      1047
           2       0.52      0.84      0.64       407

    accuracy                           0.74      4132
   macro avg       0.66      0.74      0.68      4132
weighted avg       0.77      0.74      0.75      4132

              precision    recall  f1-score   support

           0       0.86      0.78      0.82      1149
           1       0.49      0.51      0.50       449
           2       0.43      0.64      0.52       174

    accuracy                           0.70      1772
   macro avg       0.59      0.65      0.61      1772
weighted avg       0.73      0.70      0.71      1772

In [218]:
model3_dt = dt_grid.predict(test)
model3_dt
Out[218]:
array([0, 0, 1, ..., 0, 1, 1])
In [219]:
final3= pd.DataFrame({'ID' : test['id'],
                     'Outage_Duration' : model3_dt})
In [220]:
final3.to_csv("submission_dt.csv",index=False)

4) Gradient Boosting

In [221]:
gbm = GradientBoostingClassifier()
gbm.fit(X_train, y_train)
Out[221]:
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
In [222]:
gbm_params = {'max_depth': [4,5,6],
          'subsample': [0.8, 0.3],
          'max_features':[0.4,0.5],
          'n_estimators': [40,50]}
In [223]:
gbm_grid = GridSearchCV(gbm, param_grid= gbm_params, cv = 10, scoring= 'f1_macro', return_train_score= True)
In [422]:
%%time
gbm_grid.fit(X_train,y_train)
CPU times: user 4min 53s, sys: 30.8 ms, total: 4min 53s
Wall time: 4min 53s
Out[422]:
GridSearchCV(cv=10, estimator=GradientBoostingClassifier(),
             param_grid={'max_depth': [4, 5, 6], 'max_features': [0.4, 0.5],
                         'n_estimators': [40, 50], 'subsample': [0.8, 0.3]},
             return_train_score=True, scoring='f1_macro')
In [423]:
gbm_grid.best_params_
Out[423]:
{'max_depth': 6, 'max_features': 0.5, 'n_estimators': 50, 'subsample': 0.8}
In [424]:
train_pred = gbm_grid.predict(X_train)
test_pred = gbm_grid.predict(X_test)
In [425]:
print(gbm_grid.score(X_train, y_train))
print(gbm_grid.score(X_test, y_test))
print("\n")
print(confusion_matrix(y_true=y_train, y_pred = train_pred))

confusion_matrix_test = confusion_matrix(y_true=y_test, y_pred =  test_pred)
confusion_matrix_test
0.7702218676751489
0.5971990078742687


[[2718  101   51]
 [ 394  619  109]
 [  43   36  357]]
Out[425]:
array([[841,  77,  39],
       [180, 138,  56],
       [ 33,  30,  82]])
In [426]:
print('Classification report \n')
print(metrics.classification_report(y_train, train_pred))
print(metrics.classification_report(y_test,test_pred))
Classification report 

              precision    recall  f1-score   support

           0       0.86      0.95      0.90      2870
           1       0.82      0.55      0.66      1122
           2       0.69      0.82      0.75       436

    accuracy                           0.83      4428
   macro avg       0.79      0.77      0.77      4428
weighted avg       0.83      0.83      0.83      4428

              precision    recall  f1-score   support

           0       0.80      0.88      0.84       957
           1       0.56      0.37      0.45       374
           2       0.46      0.57      0.51       145

    accuracy                           0.72      1476
   macro avg       0.61      0.60      0.60      1476
weighted avg       0.71      0.72      0.71      1476

In [427]:
model4_gbm = gbm_grid.predict(test)
model4_gbm
Out[427]:
array([0, 0, 0, ..., 0, 0, 0])
In [428]:
final4= pd.DataFrame({'ID' : test['id'],
                     'Outage_Duration' : model4_gbm})
In [244]:
final4.to_csv("submission_gbm.csv",index=False)

5) XG Boost

In [429]:
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
Out[429]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
In [430]:
xgb_params = {'n_estimators':[150],
              'learning_rate': [0.05,0.10],
              'colsample_bytree': [0.5,0.1],
              'reg_alpha': [0.5,1.0], 
              'reg_lambda': [0.5,1.1],
              'subsample': [0.5,0.6,0.7],
              'max_depth': [6,8,10],}
In [431]:
xgb_grid = GridSearchCV(xgb, param_grid= xgb_params, cv = 10, scoring= 'f1_macro')
In [432]:
%%time
xgb_grid.fit(X_train,y_train)
CPU times: user 5h 42min 2s, sys: 1min 20s, total: 5h 43min 23s
Wall time: 1h 27min 50s
Out[432]:
GridSearchCV(cv=10,
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0, gpu_id=-1,
                                     importance_type='gain',
                                     interaction_constraints='',
                                     learning_rate=0.300000012,
                                     max_delta_step=0, max_depth=6,
                                     min_child_weight=1, missing=nan,
                                     monotone_constraints='()',
                                     n_estimators=100, n_jobs=0,
                                     num_paralle...
                                     objective='multi:softprob', random_state=0,
                                     reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=None, subsample=1,
                                     tree_method='exact', validate_parameters=1,
                                     verbosity=None),
             param_grid={'colsample_bytree': [0.5, 0.1],
                         'learning_rate': [0.05, 0.1], 'max_depth': [6, 8, 10],
                         'n_estimators': [150], 'reg_alpha': [0.5, 1.0],
                         'reg_lambda': [0.5, 1.1],
                         'subsample': [0.5, 0.6, 0.7]},
             scoring='f1_macro')
In [433]:
train_pred = xgb_grid.predict(X_train)
test_pred = xgb_grid.predict(X_test)
In [434]:
print(xgb_grid.score(X_train, y_train))
print(xgb_grid.score(X_test, y_test))
print("\n")
print(confusion_matrix(y_true = y_train, y_pred = train_pred))

confusion_matrix_test = confusion_matrix(y_true=y_test, y_pred =  test_pred)
confusion_matrix_test
0.8552587381157509
0.606516112828442


[[2759   85   26]
 [ 262  816   44]
 [  37   32  367]]
Out[434]:
array([[832,  94,  31],
       [168, 167,  39],
       [ 29,  46,  70]])
In [435]:
print('Classification report \n')
print(metrics.classification_report(y_train, train_pred))
print(metrics.classification_report(y_test,test_pred))
Classification report 

              precision    recall  f1-score   support

           0       0.90      0.96      0.93      2870
           1       0.87      0.73      0.79      1122
           2       0.84      0.84      0.84       436

    accuracy                           0.89      4428
   macro avg       0.87      0.84      0.86      4428
weighted avg       0.89      0.89      0.89      4428

              precision    recall  f1-score   support

           0       0.81      0.87      0.84       957
           1       0.54      0.45      0.49       374
           2       0.50      0.48      0.49       145

    accuracy                           0.72      1476
   macro avg       0.62      0.60      0.61      1476
weighted avg       0.71      0.72      0.72      1476

In [436]:
test.dtypes
Out[436]:
id                        int64
area_code                 int64
broadband_type_1          int64
broadband_type_10         int64
broadband_type_2          int64
                          ...  
transit_server_type_50    int64
transit_server_type_54    int64
transit_server_type_6     int64
transit_server_type_7     int64
transit_server_type_8     int64
Length: 184, dtype: object
In [437]:
model5_xgb = xgb_grid.predict(test)
model5_xgb
Out[437]:
array([0, 0, 0, ..., 0, 0, 0])
In [440]:
final5= pd.DataFrame({'id' : test['id'],
                     'outage_duration' : model5_xgb})
In [439]:
final5.to_csv("submission_xgb2.csv",index=False)

Understanding the important features

In [224]:
dt.feature_importances_
Out[224]:
array([2.07817225e-01, 1.74233500e-01, 3.30149984e-03, 0.00000000e+00,
       4.39648943e-03, 1.37044914e-03, 3.92632902e-03, 0.00000000e+00,
       7.17370909e-04, 8.23595313e-03, 2.12374603e-03, 0.00000000e+00,
       4.64186410e-02, 2.77590451e-03, 0.00000000e+00, 2.86948363e-04,
       2.05603473e-03, 9.53539808e-03, 1.32688138e-02, 5.46834796e-03,
       0.00000000e+00, 1.61920412e-03, 0.00000000e+00, 0.00000000e+00,
       6.98662800e-04, 0.00000000e+00, 1.33504258e-03, 3.07708781e-03,
       4.78247272e-04, 0.00000000e+00, 7.88797449e-04, 2.76497851e-03,
       3.44540716e-03, 1.34148003e-03, 1.85985050e-04, 4.78247272e-04,
       6.82196177e-04, 2.35155256e-02, 1.65950062e-03, 3.20320559e-03,
       2.88800277e-03, 4.78247272e-04, 6.37663030e-04, 0.00000000e+00,
       0.00000000e+00, 6.43451544e-03, 6.57629315e-04, 1.58178606e-03,
       2.83585394e-03, 1.47519718e-03, 0.00000000e+00, 8.75901737e-03,
       2.30900550e-03, 1.08567412e-01, 7.17370909e-04, 1.03557509e-03,
       6.50913700e-04, 1.09339196e-03, 7.23320884e-03, 0.00000000e+00,
       0.00000000e+00, 1.64281044e-03, 0.00000000e+00, 4.24539746e-03,
       0.00000000e+00, 0.00000000e+00, 8.19852467e-04, 5.06730610e-03,
       3.94135680e-03, 0.00000000e+00, 2.91737345e-03, 3.07052025e-05,
       0.00000000e+00, 2.85597209e-03, 2.78332735e-03, 0.00000000e+00,
       6.02447310e-04, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       5.63488412e-03, 0.00000000e+00, 1.72637540e-03, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 4.78247272e-04, 5.14692853e-03,
       0.00000000e+00, 8.19852467e-04, 1.79480155e-03, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 3.04867062e-03, 0.00000000e+00,
       0.00000000e+00, 3.18831515e-04, 0.00000000e+00, 4.98812981e-03,
       4.96999741e-03, 3.40799258e-03, 2.48664412e-03, 0.00000000e+00,
       8.15472913e-04, 1.52925259e-03, 0.00000000e+00, 6.87136886e-04,
       0.00000000e+00, 4.78247272e-04, 1.84249079e-03, 1.26105134e-03,
       0.00000000e+00, 1.21205891e-03, 6.69546181e-04, 0.00000000e+00,
       0.00000000e+00, 9.13017520e-04, 0.00000000e+00, 2.34832302e-03,
       4.78247272e-04, 3.27328207e-03, 0.00000000e+00, 7.38830722e-04,
       0.00000000e+00, 8.44115692e-03, 2.54908152e-04, 1.26166185e-03,
       7.62918268e-04, 0.00000000e+00, 7.53808796e-04, 6.63069706e-03,
       2.13338135e-03, 4.61898654e-03, 1.26949758e-02, 2.11287192e-03,
       1.77242052e-03, 0.00000000e+00, 1.72305271e-03, 1.14418717e-02,
       3.06229382e-03, 1.49014270e-03, 0.00000000e+00, 1.12388109e-03,
       8.62121839e-04, 1.47563092e-03, 0.00000000e+00, 0.00000000e+00,
       1.92361681e-03, 0.00000000e+00, 0.00000000e+00, 1.59415757e-04,
       6.09278134e-03, 7.97078787e-04, 4.24422687e-03, 2.17437529e-03,
       0.00000000e+00, 0.00000000e+00, 8.54813143e-04, 0.00000000e+00,
       9.42479973e-04, 7.16635144e-04, 1.51357379e-04, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 9.40609428e-03, 2.23182060e-03, 6.88903809e-04,
       5.30396921e-03, 0.00000000e+00, 0.00000000e+00, 1.18423134e-03,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       3.07760193e-03, 2.39123636e-04, 8.21554742e-04, 0.00000000e+00,
       1.41837268e-01])
In [225]:
features = train.columns
importances = dt.feature_importances_
indices = np.argsort(importances)[::-1] # np.argsort returns the indices that would sort an array.
pd.DataFrame([train.columns[indices],np.sort(importances)[::-1]])
Out[225]:
0 1 2 3 4 5 6 7 8 9 ... 175 176 177 178 179 180 181 182 183 184
0 id area_code transit_server_type_8 log_report_type_202 broadband_type_9 log_report_type_167 log_report others_0 log_report_type_71 log_report_type_8 outage_type_5 ... log_report_type_315 log_report_type_188 log_report_type_182 log_report_type_345 log_report_type_35 log_report_type_362 log_report_type_374 log_report_type_375 log_report_type_378 log_report_type_304
1 0.207817 0.174233 0.141837 0.108567 0.0464186 0.0235155 0.0132688 0.012695 0.0114419 0.0095354 ... 0 0 0 0 0 0 0 0 0 0

2 rows × 185 columns

In [226]:
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

dt = DecisionTreeClassifier()

dt.fit(X, y)

importance = dt.feature_importances_

for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))

pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Feature: 0, Score: 0.01238
Feature: 1, Score: 0.00362
Feature: 2, Score: 0.18877
Feature: 3, Score: 0.30626
Feature: 4, Score: 0.08508
Feature: 5, Score: 0.01361
Feature: 6, Score: 0.17137
Feature: 7, Score: 0.05657
Feature: 8, Score: 0.13062
Feature: 9, Score: 0.03171
In [227]:
select = indices[0:186]
dt = tree.DecisionTreeClassifier(max_depth=3)
dt = dt.fit(X_train.iloc[:,select], y_train)
In [228]:
train_pred = dt.predict(X_train.iloc[:,select])
test_pred = dt.predict(X_test.iloc[:,select])
In [229]:
print('Classification report \n')
print(metrics.classification_report(y_train, train_pred))
print(metrics.classification_report(y_test,test_pred))
Classification report 

              precision    recall  f1-score   support

           0       0.76      0.91      0.83      2678
           1       0.48      0.27      0.35      1047
           2       0.52      0.44      0.48       407

    accuracy                           0.70      4132
   macro avg       0.59      0.54      0.55      4132
weighted avg       0.67      0.70      0.67      4132

              precision    recall  f1-score   support

           0       0.76      0.93      0.84      1149
           1       0.45      0.25      0.32       449
           2       0.46      0.33      0.38       174

    accuracy                           0.70      1772
   macro avg       0.56      0.50      0.51      1772
weighted avg       0.65      0.70      0.66      1772

In [230]:
model6_dt2 = dt.predict(test)
model6_dt2
Out[230]:
array([0, 0, 0, ..., 0, 0, 0])
In [231]:
final6= pd.DataFrame({'ID' : test['id'],
                     'Outage_Duration' : model6_dt2})
In [232]:
final6.to_csv("submission_dt2.csv",index=False)

Visualization of outage_duration in the test data

In [233]:
sns.distplot((y_test))
Out[233]:
<matplotlib.axes._subplots.AxesSubplot at 0x12b2820d0>

FINAL MODEL: As seen above, XG Boost technique gives us the best results with an F1 score on train data as 85.5% and test as 60%.